corporations.py


Overview

The corporations.py file provides utilities for normalizing, tagging, and validating corporate names within the InfiniFlow system. It primarily focuses on cleaning up raw corporate name inputs, standardizing them into a normalized form, and associating them with metadata such as tags or quality indicators based on preloaded datasets. The file integrates natural language processing (NLP) techniques and region-based heuristics to handle variations in corporate names, especially those in Chinese.

Key functionalities include:


Detailed Explanation of Classes, Functions, and Methods

This file contains only functions and global data structures; no classes are defined.

Global Data Structures


Functions

baike(cid, default_v=0) -> int or float

Returns the length value associated with a corporate ID from the GOODS dataset.


corpNorm(nm, add_region=True) -> str

Normalizes a corporate name string by cleaning, tokenizing, and removing irrelevant words or tokens.


rmNoise(n) -> str

Removes noise characters and substrings commonly found in corporate names.


is_good(nm) -> bool

Determines if a corporate name is considered "good" (trusted or valid) based on the GOOD_CORP set.


corp_tag(nm) -> list

Retrieves tags associated with a corporate name based on the CORP_TAG dictionary.


Implementation Details and Algorithms


Interaction with Other Parts of the System


Visual Diagram: Class Diagram Representing Functions and Global Data

classDiagram
    class corporations {
        <<module>>
        +GOODS: pd.DataFrame
        +CORP_TKS: dict
        +GOOD_CORP: set
        +CORP_TAG: dict
        +baike(cid, default_v=0) int|float
        +corpNorm(nm, add_region=True) str
        +rmNoise(n) str
        +is_good(nm) bool
        +corp_tag(nm) list
    }

Summary

The corporations.py module is a focused utility for corporate name normalization, validation, and tagging leveraging NLP techniques, region heuristics, and preloaded datasets. It provides streamlined interfaces for other components in the InfiniFlow system to clean and enrich corporate names for downstream applications.