term_weight.py

Overview

The term_weight.py file provides utilities for tokenizing, processing, and weighting text terms primarily for natural language processing (NLP) tasks within the InfiniFlow project. Its main functionality revolves around handling Chinese and mixed-language text inputs, filtering stop words, merging tokens intelligently, and computing term weights based on frequency, named entity recognition (NER) tags, part-of-speech (POS) tags, and inverse document frequency (IDF) metrics.

This file implements a core class, Dealer, which encapsulates the logic for preprocessing text, managing stop words, handling NER data, and calculating term weights useful for downstream NLP applications such as information retrieval, question answering, or knowledge graph construction.

Classes and Methods

Class: `Dealer`

The Dealer class centralizes token processing and weight computation functionalities.

Initialization: `init(self)`

Loads stop words into a set to filter out common uninformative terms.
Loads NER (named entity recognition) data from a JSON file (ner.json).
Loads term frequency data from term.freq.
Handles exceptions gracefully with logging warnings if files are missing or unreadable.

Implementation details:

Stop words are hardcoded with a list of Chinese function words, question words, and symbols.
NER and frequency dictionaries are loaded from project resources located relative to the project base directory.
Frequency data can be either a dictionary of term frequencies or a set of terms.

Method: `pretoken(self, txt: str, num: bool=False, stpwd: bool=True) -> list`

Tokenizes the input text txt using the external rag_tokenizer, applies regex-based filtering to replace unwanted characters, removes stop words, and optionally filters out digits.

Parameters:

txt (str): The input text string to tokenize.
num (bool): Whether to allow numeric tokens (False by default).
stpwd (bool): Whether to remove stop words (True by default).

Returns:

List of tokens (strings) after filtering and tokenization.

Example usage:

dealer = Dealer()
tokens = dealer.pretoken("请问你是谁？", num=False, stpwd=True)
print(tokens)  # Output: filtered tokens without stop words and digits

Notes:

Characters matching a specified set of punctuations and symbols are replaced by # and filtered out.
Tokens matching stop words or digits (when num=False) are excluded.

Method: `tokenMerge(self, tks: list) -> list`

Merges adjacent tokens based on heuristic rules to form multi-word terms.

Parameters:

tks (list): List of tokens to merge.

Returns:

List of merged tokens.

Details:

Single-character or 1-2 character alphanumeric tokens may be merged if adjacent to certain patterns.
Merges sequences of tokens not in stop words and matching the oneTerm condition.
Caps merged token length to avoid overly long sequences.

Example:

merged_tokens = dealer.tokenMerge(["多", "工位"])
print(merged_tokens)  # Output: ['多 工位']

Method: `ner(self, t: str) -> str`

Retrieves the NER tag for a given token.

Parameters:

t (str): The token to query.

Returns:

The NER tag as a string if found, else empty string.

Method: `split(self, txt: str) -> list`

Splits input text into tokens with logic to merge adjacent alphabetic tokens unless they are tagged as "func" entities.

Parameters:

txt (str): Input text string.

Returns:

List of tokens after splitting and selective merging.

Method: `weights(self, tks: list, preprocess: bool=True) -> list`

Computes normalized weights for a list of tokens based on their frequency, NER tags, POS tags, and IDF scores.

Parameters:

tks (list): List of tokens (strings) to weight.
preprocess (bool): Whether to preprocess tokens using pretoken and tokenMerge (default True).

Returns:

List of tuples (token, normalized_weight) where weights sum to 1.

Details:

Uses several internal helper functions to compute:
- ner(t): Weight multiplier based on NER tag.
- postag(t): POS tag weight multiplier using rag_tokenizer.tag.
- freq(t): Frequency of token in corpus or fine-grained tokenization.
- df(t): Document frequency from loaded data or heuristics.
- idf(s, N): Inverse document frequency calculation.
Weight calculation formula:
[
weight = ((0.3 \times idf(freq)) + (0.7 \times idf(df))) \times (ner \times postag)
]
When preprocess=True, tokens are first split and merged before weighting.

Example usage:

weights = dealer.weights(["机器学习", "算法"], preprocess=True)
for token, weight in weights:
    print(f"{token}: {weight:.4f}")

Important Implementation Details

The file uses an external tokenizer rag_tokenizer for tokenization and POS tagging, which is assumed to be a high-quality tokenizer for Chinese and mixed-language text.
Stop words and frequency data are loaded from external resource files relative to the project base directory.
The weighting scheme integrates multiple linguistic signals: NER categories, POS tags, token frequency, and document frequency, combined with a custom IDF-like function to emphasize rarer terms.
Token merging heuristics are designed to handle short tokens and prevent fragmentation of meaningful multi-word terms.
The weights method normalizes the final weights so the sum equals 1, suitable for probabilistic or importance ranking applications.

Interaction with Other System Components

Depends on:
- rag_tokenizer module from the rag.nlp package for tokenization and POS tagging.
- api.utils.file_utils.get_project_base_directory for locating resource files.
- External resource files:
  - ner.json with NER tag mappings.
  - term.freq with term frequency statistics.
The weighted tokens produced by Dealer.weights can be used by downstream components such as:
- Information retrieval modules (for ranking query terms).
- Knowledge graph or entity linking pipelines (leveraging NER tags).
- Text summarization or keyword extraction services.
Logging is used for error reporting during resource loading.

Visual Diagram

classDiagram
    class Dealer {
        -stop_words: set
        -ne: dict
        -df: dict
        +__init__()
        +pretoken(txt: str, num: bool=False, stpwd: bool=True) list
        +tokenMerge(tks: list) list
        +ner(t: str) str
        +split(txt: str) list
        +weights(tks: list, preprocess: bool=True) list
    }

Summary

The term_weight.py file is a utility module focused on advanced token processing and weighting tailored for Chinese and mixed-language NLP tasks. Its Dealer class encapsulates the logic for stop word filtering, token merging, NER integration, and a multifactor weighting scheme that combines frequency, POS, and entity information. This component is critical for generating meaningful, weighted representations of text terms to enhance the performance of downstream NLP components in the InfiniFlow system.

term_weight.py

Overview

Classes and Methods

Class: Dealer

Initialization: __init__(self)

Method: pretoken(self, txt: str, num: bool=False, stpwd: bool=True) -> list

Method: tokenMerge(self, tks: list) -> list

Method: ner(self, t: str) -> str

Method: split(self, txt: str) -> list

Method: weights(self, tks: list, preprocess: bool=True) -> list

Important Implementation Details

Interaction with Other System Components

Visual Diagram

Summary

Class: `Dealer`

Initialization: `init(self)`

Method: `pretoken(self, txt: str, num: bool=False, stpwd: bool=True) -> list`

Method: `tokenMerge(self, tks: list) -> list`

Method: `ner(self, t: str) -> str`

Method: `split(self, txt: str) -> list`

Method: `weights(self, tks: list, preprocess: bool=True) -> list`