regions.py
Overview
The regions.py file provides a hierarchical mapping and utility functions to work with Chinese administrative regions and some international locations. It maintains a large dictionary (TBL) that maps region identifiers (IDs) to their corresponding names and parent regions, effectively defining a tree structure of administrative divisions. The file includes helper functions to:
Retrieve the full hierarchical name chain of a region given its ID.
Check whether a given string corresponds to a known region name (with some normalization).
This file is primarily used for managing and querying geographic region information, useful for applications involving location data, regional analysis, or address normalization.
Data Structures
TBL (dict)
Type:
dict[str, dict[str, str]]Description:
A dictionary mapping region IDs (as strings) to dictionaries containing:"name": The official name of the region (usually in Chinese characters)."parent": The ID of the parent region, forming a hierarchical structure.
Example Entry:
"2": {"name": "北京", "parent": "1"} # Beijing is a child of region with ID "1"Notes:
The top-level parent
"1"likely represents the country or root region.The dictionary includes provinces, cities, autonomous regions, special administrative regions, and some international entities.
The structure supports multi-level hierarchy, e.g., province -> city -> district.
NM_SET (set)
Type:
set[str]Description:
A set of all region names extracted from theTBLdata for fast membership testing.Initialization:
NM_SET = set([v["name"] for _, v in TBL.items()])
Functions
get_names(id)
Retrieve the hierarchical name chain for a given region ID.
Parameters:
id(stror any type convertible tostr): The region ID to query.
Returns:
list[str]: List of region names starting from the queried region up through its ancestors recursively. Ifidis invalid or not found, returns an empty list.
Behavior and Implementation Details:
If
idisNone, empty, or the string"none"(case-insensitive), returns an empty list.Converts the
idto a string.Uses a regular expression to check if
idcontains only digits.If not, returns a list containing the
iditself (assumed to be a name).
Looks up
idinTBL.If not found, returns an empty list.
Otherwise, appends the
"name"associated with theidto the result list.Recursively calls
get_nameson the"parent"ID and extends the result list with the parent's names.Returns the combined list.
Example Usage:
>>> get_names("33") ['北京市', '北京', '1'] >>> get_names(2) ['北京', '1'] >>> get_names("invalid_id") [] >>> get_names("北京") ['北京']Notes:
Returns names from the specific region upward to the root.
It can handle IDs given as integers or strings.
The root's parent ID
"1"does not appear explicitly inTBL, so recursion stops when no parent is found.
isName(nm)
Check if a given string corresponds to a known region name, accounting for common suffix variations.
Parameters:
nm(str): The name string to check.
Returns:
bool:Trueifnmmatches a known region name or a normalized variant, otherwiseFalse.
Behavior and Implementation Details:
Checks if
nmis directly inNM_SET.Checks if
nm + "市"(adding the suffix "市" meaning "city") is inNM_SET.Uses a regular expression to remove suffixes such as
"省"(province),"自治区"(autonomous region), or ethnic group specific autonomous region suffixes like"回族自治区","壮族自治区","维吾尔自治区"from the end ofnm.Checks if the resulting stripped string is in
NM_SET.
Returns
Trueif any of the above checks succeed; otherwise,False.
Example Usage:
>>> isName("北京") True >>> isName("北京市") True >>> isName("新疆维吾尔自治区") True >>> isName("不存在的地区") FalseNotes:
This function helps normalize and match user input or data that may include or omit typical regional suffixes.
Useful for validating or filtering region names.
Implementation Details and Algorithms
The hierarchical data is stored in a flat dictionary
TBLindexed by string IDs.Parent-child relationships are encoded via the
"parent"key.The
get_namesfunction uses recursion to ascend the hierarchy and collect names.Regular expressions are used in
get_namesto validate ID format and inisNameto normalize region names by removing suffixes.The
NM_SETset provides O(1) average time complexity for membership checks inisName.
Interaction with Other Parts of the System/Application
Input/Output:
Functions expect region IDs or names as input and return names or boolean validation flags.
Dependencies:
Only the Python
remodule is imported for regular expressions.
Potential Usage Scenarios:
Address parsing and normalization.
Geographic data processing or validation.
User interface elements that allow selection of regions.
Data analytics requiring region hierarchies.
No external dependencies or side-effects:
The file is self-contained and purely functional.
Diagram: Class/Function Structure
Since this file is a utility module without classes, the following flowchart shows the relationships between main functions and data structures.
flowchart TD
TBL["TBL (Region data dict)"]
NM_SET["NM_SET (Set of region names)"]
get_names["get_names(id)"]
isName["isName(nm)"]
subgraph Data
TBL
NM_SET
end
subgraph Functions
get_names
isName
end
get_names -->|Uses| TBL
isName -->|Uses| NM_SET
isName -->|Uses regex for suffix removal| re[/"re module"/]
get_names -->|Uses regex for ID validation| re
Summary
regions.py provides a comprehensive hierarchical dataset of Chinese and some international regions and utility functions to query and validate region names. It is optimized for quick membership tests and hierarchical lookups, supporting applications that require geographic region management or normalization.
Example Usage Summary
import regions
# Get full name hierarchy for a region ID
names = regions.get_names("33")
print(names) # Output: ['北京市', '北京', '1']
# Check if a string is a valid region name
print(regions.isName("新疆维吾尔自治区")) # Output: True
print(regions.isName("不存在的地方")) # Output: False
End of regions.py documentation.