term_weight.py
Overview
The term_weight.py file provides utilities for tokenizing, processing, and weighting text terms primarily for natural language processing (NLP) tasks within the InfiniFlow project. Its main functionality revolves around handling Chinese and mixed-language text inputs, filtering stop words, merging tokens intelligently, and computing term weights based on frequency, named entity recognition (NER) tags, part-of-speech (POS) tags, and inverse document frequency (IDF) metrics.
This file implements a core class, Dealer, which encapsulates the logic for preprocessing text, managing stop words, handling NER data, and calculating term weights useful for downstream NLP applications such as information retrieval, question answering, or knowledge graph construction.
Classes and Methods
Class: Dealer
The Dealer class centralizes token processing and weight computation functionalities.
Initialization: __init__(self)
Loads stop words into a set to filter out common uninformative terms.
Loads NER (named entity recognition) data from a JSON file (
ner.json).Loads term frequency data from
term.freq.Handles exceptions gracefully with logging warnings if files are missing or unreadable.
Implementation details:
Stop words are hardcoded with a list of Chinese function words, question words, and symbols.
NER and frequency dictionaries are loaded from project resources located relative to the project base directory.
Frequency data can be either a dictionary of term frequencies or a set of terms.
Method: pretoken(self, txt: str, num: bool=False, stpwd: bool=True) -> list
Tokenizes the input text txt using the external rag_tokenizer, applies regex-based filtering to replace unwanted characters, removes stop words, and optionally filters out digits.
Parameters:
txt(str): The input text string to tokenize.num(bool): Whether to allow numeric tokens (Falseby default).stpwd(bool): Whether to remove stop words (Trueby default).
Returns:
List of tokens (strings) after filtering and tokenization.
Example usage:
dealer = Dealer()
tokens = dealer.pretoken("请问你是谁?", num=False, stpwd=True)
print(tokens) # Output: filtered tokens without stop words and digits
Notes:
Characters matching a specified set of punctuations and symbols are replaced by
#and filtered out.Tokens matching stop words or digits (when
num=False) are excluded.
Method: tokenMerge(self, tks: list) -> list
Merges adjacent tokens based on heuristic rules to form multi-word terms.
Parameters:
tks(list): List of tokens to merge.
Returns:
List of merged tokens.
Details:
Single-character or 1-2 character alphanumeric tokens may be merged if adjacent to certain patterns.
Merges sequences of tokens not in stop words and matching the oneTerm condition.
Caps merged token length to avoid overly long sequences.
Example:
merged_tokens = dealer.tokenMerge(["多", "工位"])
print(merged_tokens) # Output: ['多 工位']
Method: ner(self, t: str) -> str
Retrieves the NER tag for a given token.
Parameters:
t(str): The token to query.
Returns:
The NER tag as a string if found, else empty string.
Method: split(self, txt: str) -> list
Splits input text into tokens with logic to merge adjacent alphabetic tokens unless they are tagged as "func" entities.
Parameters:
txt(str): Input text string.
Returns:
List of tokens after splitting and selective merging.
Method: weights(self, tks: list, preprocess: bool=True) -> list
Computes normalized weights for a list of tokens based on their frequency, NER tags, POS tags, and IDF scores.
Parameters:
tks(list): List of tokens (strings) to weight.preprocess(bool): Whether to preprocess tokens usingpretokenandtokenMerge(defaultTrue).
Returns:
List of tuples
(token, normalized_weight)where weights sum to 1.
Details:
Uses several internal helper functions to compute:
ner(t): Weight multiplier based on NER tag.postag(t): POS tag weight multiplier usingrag_tokenizer.tag.freq(t): Frequency of token in corpus or fine-grained tokenization.df(t): Document frequency from loaded data or heuristics.idf(s, N): Inverse document frequency calculation.
Weight calculation formula:
[
weight = ((0.3 \times idf(freq)) + (0.7 \times idf(df))) \times (ner \times postag)
]When
preprocess=True, tokens are first split and merged before weighting.
Example usage:
weights = dealer.weights(["机器学习", "算法"], preprocess=True)
for token, weight in weights:
print(f"{token}: {weight:.4f}")
Important Implementation Details
The file uses an external tokenizer
rag_tokenizerfor tokenization and POS tagging, which is assumed to be a high-quality tokenizer for Chinese and mixed-language text.Stop words and frequency data are loaded from external resource files relative to the project base directory.
The weighting scheme integrates multiple linguistic signals: NER categories, POS tags, token frequency, and document frequency, combined with a custom IDF-like function to emphasize rarer terms.
Token merging heuristics are designed to handle short tokens and prevent fragmentation of meaningful multi-word terms.
The
weightsmethod normalizes the final weights so the sum equals 1, suitable for probabilistic or importance ranking applications.
Interaction with Other System Components
Depends on:
rag_tokenizermodule from therag.nlppackage for tokenization and POS tagging.api.utils.file_utils.get_project_base_directoryfor locating resource files.External resource files:
ner.jsonwith NER tag mappings.term.freqwith term frequency statistics.
The weighted tokens produced by
Dealer.weightscan be used by downstream components such as:Information retrieval modules (for ranking query terms).
Knowledge graph or entity linking pipelines (leveraging NER tags).
Text summarization or keyword extraction services.
Logging is used for error reporting during resource loading.
Visual Diagram
classDiagram
class Dealer {
-stop_words: set
-ne: dict
-df: dict
+__init__()
+pretoken(txt: str, num: bool=False, stpwd: bool=True) list
+tokenMerge(tks: list) list
+ner(t: str) str
+split(txt: str) list
+weights(tks: list, preprocess: bool=True) list
}
Summary
The term_weight.py file is a utility module focused on advanced token processing and weighting tailored for Chinese and mixed-language NLP tasks. Its Dealer class encapsulates the logic for stop word filtering, token merging, NER integration, and a multifactor weighting scheme that combines frequency, POS, and entity information. This component is critical for generating meaningful, weighted representations of text terms to enhance the performance of downstream NLP components in the InfiniFlow system.