corporations.py

Overview

The corporations.py file provides utilities for normalizing, tagging, and validating corporate names within the InfiniFlow system. It primarily focuses on cleaning up raw corporate name inputs, standardizing them into a normalized form, and associating them with metadata such as tags or quality indicators based on preloaded datasets. The file integrates natural language processing (NLP) techniques and region-based heuristics to handle variations in corporate names, especially those in Chinese.

Key functionalities include:

Loading and managing auxiliary data resources related to corporations.
Normalizing corporate names by removing noise, standardizing characters, and filtering irrelevant tokens.
Determining if a corporate name is considered "good" (trusted or valid).
Tagging corporate names with associated metadata.
Providing length lookup for corporate entries from an external dataset.

Detailed Explanation of Classes, Functions, and Methods

This file contains only functions and global data structures; no classes are defined.

Global Data Structures

GOODS (pandas.DataFrame):
Loaded from res/corp_baike_len.csv, indexed by corporate IDs (cid), contains length-related metadata for corporations. Missing values are filled with 0.
CORP_TKS (dict):
Token frequency dictionary loaded from res/corp.tks.freq.json, used as a filter to remove common or irrelevant tokens during normalization.
GOOD_CORP (set):
Set of normalized "good" corporate names loaded from res/good_corp.json and normalized via corpNorm and rmNoise.
CORP_TAG (dict):
Dictionary mapping normalized corporate names to their associated tags, loaded from res/corp_tag.json and normalized similarly.

Functions

`baike(cid, default_v=0) -> int or float`

Returns the length value associated with a corporate ID from the GOODS dataset.

Parameters:
- cid (str or int): Corporate ID to look up.
- default_v (int or float, optional): Default return value if cid is not found in GOODS. Defaults to 0.
Returns:
- The length (int or float) associated with the corporate ID, or default_v if not found.
Usage Example:
```
length = baike("12345")
```
Details:
Uses pandas .loc to access the row by corporate ID and retrieve the "len" column. Returns default if lookup fails.

`corpNorm(nm, add_region=True) -> str`

Normalizes a corporate name string by cleaning, tokenizing, and removing irrelevant words or tokens.

Parameters:
- nm (str): Raw corporate name to normalize.
- add_region (bool, optional): Whether to append a detected region name (in parentheses) to the normalized string. Defaults to True.
Returns:
- Normalized corporate name string.

Usage Example:

normalized_name = corpNorm("北京市华为技术有限公司")
# Output might be: "华为(北京)"

Implementation Details:
- Converts traditional Chinese characters to simplified and converts full-width characters to half-width.
- Lowercases the string.
- Removes HTML escape sequences like &.
- Removes punctuations, corporate suffixes (e.g., Ltd., Inc.), and generic company descriptors (e.g., 技术, 有限公司).
- Uses a tokenizer (rag_tokenizer.tokenize) to split the name into tokens.
- Removes tokens that are region names or common corporate tokens (CORP_TKS).
- Attempts to detect region names and append them if add_region is True.
- Uses regex patterns to trim certain trailing or leading characters.

`rmNoise(n) -> str`

Removes noise characters and substrings commonly found in corporate names.

Parameters:
- n (str): Corporate name string.
Returns:
- String with noise characters removed.
Behavior:
- Removes substrings within parentheses (both round () and Chinese （）).
- Removes characters like commas, periods, spaces, ampersands, and brackets.

Usage Example:

clean_name = rmNoise("华为（中国）有限公司")
# Output: "华为有限公司"

`is_good(nm) -> bool`

Determines if a corporate name is considered "good" (trusted or valid) based on the GOOD_CORP set.

Parameters:
- nm (str): Corporate name string to check.
Returns:
- True if the name is in the good corporate set; otherwise False.

Usage Example:

is_valid = is_good("华为技术有限公司")
# True or False

Implementation Details:
- Filters out names containing the substring "外派".
- Removes noise and normalizes the name.
- Checks if the normalized name matches or contains any entry in GOOD_CORP.
- For alphanumeric-only entries in GOOD_CORP, an exact match is required. Otherwise, substring containment suffices.

`corp_tag(nm) -> list`

Retrieves tags associated with a corporate name based on the CORP_TAG dictionary.

Parameters:
- nm (str): Corporate name string.
Returns:
- List of tags if found, otherwise an empty list [].

Usage Example:

tags = corp_tag("华为技术有限公司")
# Output: ["tech", "telecom"]

Implementation Details:
- Removes noise and normalizes the name.
- Iterates through keys in CORP_TAG.
- Performs exact match for alphanumeric keys or substring match otherwise.
- Skips matches where the tag is very short and the normalized name is significantly longer (to avoid false positives).

Implementation Details and Algorithms

Normalization Pipeline:
1. Traditional to Simplified Chinese conversion and full-width to half-width character normalization using rag_tokenizer.
2. Removal of HTML entities and noisy punctuations.
3. Removal of common suffix patterns for corporate names (e.g., Inc., Ltd.).
4. Tokenization and filtering out tokens that represent region names or are part of a frequent corporate token list.
5. Region extraction from tokens and optional appending to normalized name.
6. Use of regex for additional trimming of trailing or leading character sequences.
Data Loading:
- All auxiliary data files (csv and json) are loaded once at module load time for efficiency.
- Normalization of these datasets is done upfront to ensure quick matching during runtime.
Matching Logic:
- Uses a combination of exact and substring matching with regex to verify corporate names against known good corporations and tags.
- Uses heuristics to avoid false matches for very short tags.

Interaction with Other Parts of the System

rag.nlp.rag_tokenizer:
Used heavily for text normalization and tokenization functions.
regions module:
Used to verify if tokens represent valid region names, influencing normalization and tagging.
Resource files (in res/ directory):
- corp_baike_len.csv: Provides length metadata for corporate IDs.
- corp.tks.freq.json: Contains frequent corporate tokens to filter out during normalization.
- good_corp.json: List of trusted corporate names.
- corp_tag.json: Map of corporate names to tags.
Higher-level workflows:
This module is likely used in data ingestion or NLP pipelines where raw corporate names need to be cleaned, validated, and enriched with metadata before downstream processing such as entity recognition, knowledge graph construction, or analytics.

Visual Diagram: Class Diagram Representing Functions and Global Data

classDiagram
    class corporations {
        <<module>>
        +GOODS: pd.DataFrame
        +CORP_TKS: dict
        +GOOD_CORP: set
        +CORP_TAG: dict
        +baike(cid, default_v=0) int|float
        +corpNorm(nm, add_region=True) str
        +rmNoise(n) str
        +is_good(nm) bool
        +corp_tag(nm) list
    }

Summary

The corporations.py module is a focused utility for corporate name normalization, validation, and tagging leveraging NLP techniques, region heuristics, and preloaded datasets. It provides streamlined interfaces for other components in the InfiniFlow system to clean and enrich corporate names for downstream applications.