regions.py

Overview

The regions.py file provides a hierarchical mapping and utility functions to work with Chinese administrative regions and some international locations. It maintains a large dictionary (TBL) that maps region identifiers (IDs) to their corresponding names and parent regions, effectively defining a tree structure of administrative divisions. The file includes helper functions to:

Retrieve the full hierarchical name chain of a region given its ID.
Check whether a given string corresponds to a known region name (with some normalization).

This file is primarily used for managing and querying geographic region information, useful for applications involving location data, regional analysis, or address normalization.

Data Structures

`TBL` (dict)

Type: dict[str, dict[str, str]]
Description:
A dictionary mapping region IDs (as strings) to dictionaries containing:
- "name": The official name of the region (usually in Chinese characters).
- "parent": The ID of the parent region, forming a hierarchical structure.

Example Entry:

"2": {"name": "北京", "parent": "1"}  # Beijing is a child of region with ID "1"

Notes:
- The top-level parent "1" likely represents the country or root region.
- The dictionary includes provinces, cities, autonomous regions, special administrative regions, and some international entities.
- The structure supports multi-level hierarchy, e.g., province -> city -> district.

`NM_SET` (set)

Type: set[str]
Description:
A set of all region names extracted from the TBL data for fast membership testing.

Initialization:

NM_SET = set([v["name"] for _, v in TBL.items()])

Functions

`get_names(id)`

Retrieve the hierarchical name chain for a given region ID.

Parameters:
- id (str or any type convertible to str): The region ID to query.
Returns:
- list[str]: List of region names starting from the queried region up through its ancestors recursively. If id is invalid or not found, returns an empty list.
Behavior and Implementation Details:
1. If id is None, empty, or the string "none" (case-insensitive), returns an empty list.
2. Converts the id to a string.
3. Uses a regular expression to check if id contains only digits.
  - If not, returns a list containing the id itself (assumed to be a name).
4. Looks up id in TBL.
  - If not found, returns an empty list.
5. Otherwise, appends the "name" associated with the id to the result list.
6. Recursively calls get_names on the "parent" ID and extends the result list with the parent's names.
7. Returns the combined list.

Example Usage:

>>> get_names("33")
['北京市', '北京', '1']

>>> get_names(2)
['北京', '1']

>>> get_names("invalid_id")
[]

>>> get_names("北京")
['北京']

Notes:
- Returns names from the specific region upward to the root.
- It can handle IDs given as integers or strings.
- The root's parent ID "1" does not appear explicitly in TBL, so recursion stops when no parent is found.

`isName(nm)`

Check if a given string corresponds to a known region name, accounting for common suffix variations.

Parameters:
- nm (str): The name string to check.
Returns:
- bool: True if nm matches a known region name or a normalized variant, otherwise False.
Behavior and Implementation Details:
1. Checks if nm is directly in NM_SET.
2. Checks if nm + "市" (adding the suffix "市" meaning "city") is in NM_SET.
3. Uses a regular expression to remove suffixes such as "省" (province), "自治区" (autonomous region), or ethnic group specific autonomous region suffixes like "回族自治区", "壮族自治区", "维吾尔自治区" from the end of nm.
  - Checks if the resulting stripped string is in NM_SET.
4. Returns True if any of the above checks succeed; otherwise, False.

Example Usage:

>>> isName("北京")
True
>>> isName("北京市")
True
>>> isName("新疆维吾尔自治区")
True
>>> isName("不存在的地区")
False

Notes:
- This function helps normalize and match user input or data that may include or omit typical regional suffixes.
- Useful for validating or filtering region names.

Implementation Details and Algorithms

The hierarchical data is stored in a flat dictionary TBL indexed by string IDs.
Parent-child relationships are encoded via the "parent" key.
The get_names function uses recursion to ascend the hierarchy and collect names.
Regular expressions are used in get_names to validate ID format and in isName to normalize region names by removing suffixes.
The NM_SET set provides O(1) average time complexity for membership checks in isName.

Interaction with Other Parts of the System/Application

Input/Output:
- Functions expect region IDs or names as input and return names or boolean validation flags.
Dependencies:
- Only the Python re module is imported for regular expressions.
Potential Usage Scenarios:
- Address parsing and normalization.
- Geographic data processing or validation.
- User interface elements that allow selection of regions.
- Data analytics requiring region hierarchies.
No external dependencies or side-effects:
- The file is self-contained and purely functional.

Diagram: Class/Function Structure

Since this file is a utility module without classes, the following flowchart shows the relationships between main functions and data structures.

flowchart TD
    TBL["TBL (Region data dict)"]
    NM_SET["NM_SET (Set of region names)"]
    
    get_names["get_names(id)"]
    isName["isName(nm)"]
    
    subgraph Data
      TBL
      NM_SET
    end
    
    subgraph Functions
      get_names
      isName
    end
    
    get_names -->|Uses| TBL
    isName -->|Uses| NM_SET
    isName -->|Uses regex for suffix removal| re[/"re module"/]
    get_names -->|Uses regex for ID validation| re

Summary

regions.py provides a comprehensive hierarchical dataset of Chinese and some international regions and utility functions to query and validate region names. It is optimized for quick membership tests and hierarchical lookups, supporting applications that require geographic region management or normalization.

Example Usage Summary

import regions

# Get full name hierarchy for a region ID
names = regions.get_names("33")
print(names)  # Output: ['北京市', '北京', '1']

# Check if a string is a valid region name
print(regions.isName("新疆维吾尔自治区"))  # Output: True
print(regions.isName("不存在的地方"))      # Output: False

End of regions.py documentation.