corporations.py
Overview
The corporations.py file provides utilities for normalizing, tagging, and validating corporate names within the InfiniFlow system. It primarily focuses on cleaning up raw corporate name inputs, standardizing them into a normalized form, and associating them with metadata such as tags or quality indicators based on preloaded datasets. The file integrates natural language processing (NLP) techniques and region-based heuristics to handle variations in corporate names, especially those in Chinese.
Key functionalities include:
Loading and managing auxiliary data resources related to corporations.
Normalizing corporate names by removing noise, standardizing characters, and filtering irrelevant tokens.
Determining if a corporate name is considered "good" (trusted or valid).
Tagging corporate names with associated metadata.
Providing length lookup for corporate entries from an external dataset.
Detailed Explanation of Classes, Functions, and Methods
This file contains only functions and global data structures; no classes are defined.
Global Data Structures
GOODS (
pandas.DataFrame):
Loaded fromres/corp_baike_len.csv, indexed by corporate IDs (cid), contains length-related metadata for corporations. Missing values are filled with 0.CORP_TKS (
dict):
Token frequency dictionary loaded fromres/corp.tks.freq.json, used as a filter to remove common or irrelevant tokens during normalization.GOOD_CORP (
set):
Set of normalized "good" corporate names loaded from res/good_corp.json and normalized viacorpNormandrmNoise.CORP_TAG (
dict):
Dictionary mapping normalized corporate names to their associated tags, loaded from res/corp_tag.json and normalized similarly.
Functions
baike(cid, default_v=0) -> int or float
Returns the length value associated with a corporate ID from the GOODS dataset.
Parameters:
cid(strorint): Corporate ID to look up.default_v(intorfloat, optional): Default return value ifcidis not found inGOODS. Defaults to0.
Returns:
The length (
intorfloat) associated with the corporate ID, ordefault_vif not found.
Usage Example:
length = baike("12345")Details:
Uses pandas.locto access the row by corporate ID and retrieve the "len" column. Returns default if lookup fails.
corpNorm(nm, add_region=True) -> str
Normalizes a corporate name string by cleaning, tokenizing, and removing irrelevant words or tokens.
Parameters:
nm(str): Raw corporate name to normalize.add_region(bool, optional): Whether to append a detected region name (in parentheses) to the normalized string. Defaults toTrue.
Returns:
Normalized corporate name string.
Usage Example:
normalized_name = corpNorm("北京市华为技术有限公司") # Output might be: "华为(北京)"Implementation Details:
Converts traditional Chinese characters to simplified and converts full-width characters to half-width.
Lowercases the string.
Removes HTML escape sequences like
&.Removes punctuations, corporate suffixes (e.g., Ltd., Inc.), and generic company descriptors (e.g., 技术, 有限公司).
Uses a tokenizer (
rag_tokenizer.tokenize) to split the name into tokens.Removes tokens that are region names or common corporate tokens (
CORP_TKS).Attempts to detect region names and append them if
add_regionisTrue.Uses regex patterns to trim certain trailing or leading characters.
rmNoise(n) -> str
Removes noise characters and substrings commonly found in corporate names.
Parameters:
n(str): Corporate name string.
Returns:
String with noise characters removed.
Behavior:
Removes substrings within parentheses (both round
()and Chinese()).Removes characters like commas, periods, spaces, ampersands, and brackets.
Usage Example:
clean_name = rmNoise("华为(中国)有限公司") # Output: "华为有限公司"
is_good(nm) -> bool
Determines if a corporate name is considered "good" (trusted or valid) based on the GOOD_CORP set.
Parameters:
nm(str): Corporate name string to check.
Returns:
Trueif the name is in the good corporate set; otherwiseFalse.
Usage Example:
is_valid = is_good("华为技术有限公司") # True or FalseImplementation Details:
Filters out names containing the substring
"外派".Removes noise and normalizes the name.
Checks if the normalized name matches or contains any entry in
GOOD_CORP.For alphanumeric-only entries in
GOOD_CORP, an exact match is required. Otherwise, substring containment suffices.
corp_tag(nm) -> list
Retrieves tags associated with a corporate name based on the CORP_TAG dictionary.
Parameters:
nm(str): Corporate name string.
Returns:
List of tags if found, otherwise an empty list
[].
Usage Example:
tags = corp_tag("华为技术有限公司") # Output: ["tech", "telecom"]Implementation Details:
Removes noise and normalizes the name.
Iterates through keys in
CORP_TAG.Performs exact match for alphanumeric keys or substring match otherwise.
Skips matches where the tag is very short and the normalized name is significantly longer (to avoid false positives).
Implementation Details and Algorithms
Normalization Pipeline:
Traditional to Simplified Chinese conversion and full-width to half-width character normalization using
rag_tokenizer.Removal of HTML entities and noisy punctuations.
Removal of common suffix patterns for corporate names (e.g., Inc., Ltd.).
Tokenization and filtering out tokens that represent region names or are part of a frequent corporate token list.
Region extraction from tokens and optional appending to normalized name.
Use of regex for additional trimming of trailing or leading character sequences.
Data Loading:
All auxiliary data files (
csvandjson) are loaded once at module load time for efficiency.Normalization of these datasets is done upfront to ensure quick matching during runtime.
Matching Logic:
Uses a combination of exact and substring matching with regex to verify corporate names against known good corporations and tags.
Uses heuristics to avoid false matches for very short tags.
Interaction with Other Parts of the System
rag.nlp.rag_tokenizer:
Used heavily for text normalization and tokenization functions.regions module:
Used to verify if tokens represent valid region names, influencing normalization and tagging.Resource files (in
res/directory):corp_baike_len.csv: Provides length metadata for corporate IDs.corp.tks.freq.json: Contains frequent corporate tokens to filter out during normalization.good_corp.json: List of trusted corporate names.corp_tag.json: Map of corporate names to tags.
Higher-level workflows:
This module is likely used in data ingestion or NLP pipelines where raw corporate names need to be cleaned, validated, and enriched with metadata before downstream processing such as entity recognition, knowledge graph construction, or analytics.
Visual Diagram: Class Diagram Representing Functions and Global Data
classDiagram
class corporations {
<<module>>
+GOODS: pd.DataFrame
+CORP_TKS: dict
+GOOD_CORP: set
+CORP_TAG: dict
+baike(cid, default_v=0) int|float
+corpNorm(nm, add_region=True) str
+rmNoise(n) str
+is_good(nm) bool
+corp_tag(nm) list
}
Summary
The corporations.py module is a focused utility for corporate name normalization, validation, and tagging leveraging NLP techniques, region heuristics, and preloaded datasets. It provides streamlined interfaces for other components in the InfiniFlow system to clean and enrich corporate names for downstream applications.