term_weight.py


Overview

The term_weight.py file provides utilities for tokenizing, processing, and weighting text terms primarily for natural language processing (NLP) tasks within the InfiniFlow project. Its main functionality revolves around handling Chinese and mixed-language text inputs, filtering stop words, merging tokens intelligently, and computing term weights based on frequency, named entity recognition (NER) tags, part-of-speech (POS) tags, and inverse document frequency (IDF) metrics.

This file implements a core class, Dealer, which encapsulates the logic for preprocessing text, managing stop words, handling NER data, and calculating term weights useful for downstream NLP applications such as information retrieval, question answering, or knowledge graph construction.


Classes and Methods

Class: Dealer

The Dealer class centralizes token processing and weight computation functionalities.

Initialization: __init__(self)

Implementation details:


Method: pretoken(self, txt: str, num: bool=False, stpwd: bool=True) -> list

Tokenizes the input text txt using the external rag_tokenizer, applies regex-based filtering to replace unwanted characters, removes stop words, and optionally filters out digits.

Parameters:

Returns:

Example usage:

dealer = Dealer()
tokens = dealer.pretoken("请问你是谁?", num=False, stpwd=True)
print(tokens)  # Output: filtered tokens without stop words and digits

Notes:


Method: tokenMerge(self, tks: list) -> list

Merges adjacent tokens based on heuristic rules to form multi-word terms.

Parameters:

Returns:

Details:

Example:

merged_tokens = dealer.tokenMerge(["多", "工位"])
print(merged_tokens)  # Output: ['多 工位']

Method: ner(self, t: str) -> str

Retrieves the NER tag for a given token.

Parameters:

Returns:


Method: split(self, txt: str) -> list

Splits input text into tokens with logic to merge adjacent alphabetic tokens unless they are tagged as "func" entities.

Parameters:

Returns:


Method: weights(self, tks: list, preprocess: bool=True) -> list

Computes normalized weights for a list of tokens based on their frequency, NER tags, POS tags, and IDF scores.

Parameters:

Returns:

Details:

Example usage:

weights = dealer.weights(["机器学习", "算法"], preprocess=True)
for token, weight in weights:
    print(f"{token}: {weight:.4f}")

Important Implementation Details


Interaction with Other System Components


Visual Diagram

classDiagram
    class Dealer {
        -stop_words: set
        -ne: dict
        -df: dict
        +__init__()
        +pretoken(txt: str, num: bool=False, stpwd: bool=True) list
        +tokenMerge(tks: list) list
        +ner(t: str) str
        +split(txt: str) list
        +weights(tks: list, preprocess: bool=True) list
    }

Summary

The term_weight.py file is a utility module focused on advanced token processing and weighting tailored for Chinese and mixed-language NLP tasks. Its Dealer class encapsulates the logic for stop word filtering, token merging, NER integration, and a multifactor weighting scheme that combines frequency, POS, and entity information. This component is critical for generating meaningful, weighted representations of text terms to enhance the performance of downstream NLP components in the InfiniFlow system.