surname.py

Overview

surname.py is a utility module designed to identify whether a given string corresponds to a recognized Chinese surname. The module encapsulates a comprehensive set of traditional Chinese surnames, including both single-character and compound (multi-character) family names, reflecting historical and modern usage.

The core functionality is provided by a single function isit(n), which checks membership of an input string within this predefined surname set. This module is intended for use in applications that require validation or recognition of Chinese surnames, such as natural language processing, user data validation, or cultural data analysis.

Global Variable:
- m — a set containing all known Chinese surnames (single and compound characters).
Function:
- isit(n) — checks if the input string n is a known Chinese surname.

Detailed Explanation

Variable: `m`

Type: set[str]
Description:
m contains a collection of Chinese surnames. This includes:
- The 100 most common single-character surnames.
- Hundreds of less common single-character surnames.
- Traditional compound surnames consisting of two or more characters (e.g., "欧阳", "司马", "上官").
Implementation Detail:
The use of a set provides O(1) average time complexity for membership tests, making isit() efficient even with a large number of surnames.

Example content snippet:

m = set([
  "赵", "钱", "孙", "李", ..., "欧阳", "司马", "上官", "夏侯", ...
])

Function: `isit(n)`

def isit(n):
    return n.strip() in m

Purpose:
Determines if the input string n matches a known Chinese surname from the set m.
Parameters:
- n (str): A string representing a potential Chinese surname. The function strips leading and trailing whitespace before testing.
Returns:
- bool: True if n (after stripping) is in the surname set m, otherwise False.
Usage Example:

>>> isit("李")
True

>>> isit("  欧阳 ")
True

>>> isit("张三")
False  # '张三' is not a surname, '张' is.

>>> isit("Smith")
False

Notes:
- The function only matches exact surname strings; it does not parse or extract surnames from longer names or phrases.
- Trimming whitespace ensures that accidental spaces do not affect the check.

Implementation Details

The module relies on a statically defined set m which includes an extensive list of Chinese surnames gathered from historical and contemporary sources.
The choice of a set for m ensures fast lookups.
The function isit is minimalist and efficient, performing only a strip and membership check.
No external dependencies or complex algorithms are used.

Integration and Interaction

Intended Use Case:
This module is suitable as a standalone surname validator or as a component in larger systems that process Chinese personal names, such as:
- Identity verification systems.
- Chinese NLP pipelines (e.g., name entity recognition).
- Databases requiring surname validation.
- User input sanitization for Chinese names.
Interaction with Other Modules:
The module is independent; it exports only isit(). Other parts of an application can import and use this function to validate surnames before further processing or storage.

Diagram: Module Structure

flowchart TD
    A[Input String n] --> B[Strip whitespace]
    B --> C{Is n in set m?}
    C -->|Yes| D[Return True]
    C -->|No| E[Return False]
    style B fill:#f9f,stroke:#333,stroke-width:1px
    style C fill:#bbf,stroke:#333,stroke-width:1px
    style D fill:#afa,stroke:#333,stroke-width:1px
    style E fill:#faa,stroke:#333,stroke-width:1px

Summary

surname.py provides a fast, simple, and effective way to verify if a string is a recognized Chinese surname. With its extensive and carefully curated surname dataset, it supports both common and rare family names, including compound surnames. Its minimalistic design allows easy integration into larger software systems requiring Chinese surname validation.

End of documentation for surname.py