rag_tokenizer.py
Overview
The rag_tokenizer.py file defines a sophisticated tokenizer class, RagTokenizer, designed primarily for processing multilingual text including Chinese (both simplified and traditional), English, and numeric/alphanumeric strings. It combines trie-based dictionary lookups, heuristic segmentation algorithms, and language-specific normalization techniques to perform fine-grained tokenization and tagging of text for natural language processing tasks.
This tokenizer is particularly tailored for Chinese text segmentation with fallback support for English stemming and lemmatization, making it suitable for applications such as information retrieval, text mining, machine learning preprocessing, or any system requiring accurate tokenization of mixed-language content.
Detailed Explanation of Components
Class: RagTokenizer
RagTokenizer encapsulates all logic related to loading token dictionaries, segmenting input text, and normalizing tokens. It utilizes a trie structure (datrie.Trie) for efficient prefix-based lookup of tokens and their associated frequency and tag information.
Initialization: __init__(self, debug=False)
Purpose: Initialize the tokenizer instance by loading the token dictionary trie from disk or building it from a dictionary text file if missing.
Parameters:
debug(bool): Optional flag to enable debug logging.
Details:
Sets up stemming and lemmatization tools (
PorterStemmer,WordNetLemmatizer) for English.Defines regex patterns for splitting tokens by punctuation, alphanumeric groups, and special characters.
Attempts to load a cached trie file (
huqie.txt.trie), otherwise builds the trie from dictionary filehuqie.txt.
Usage:
tokenizer = RagTokenizer(debug=True)
Method: key_(self, line)
Purpose: Generate a consistent trie key from a string by lowercasing and UTF-8 encoding.
Parameters:
line(str): Input token string.
Returns: (str) Encoded key string for trie lookup.
Usage:
key = tokenizer.key_("Example")
Method: rkey_(self, line)
Purpose: Generate a reversed and prefixed trie key, used for backward trie searches.
Parameters: Same as
key_.Returns: (str) Reversed encoded key string.
Usage:
reversed_key = tokenizer.rkey_("Example")
Method: loadDict_(self, fnm)
Purpose: Load tokens from a dictionary file into the trie structure, calculating logarithmic frequency scores.
Parameters:
fnm(str): Path to the dictionary text file.
Details:
Reads lines, extracts token, frequency, and tag.
Converts frequency to a logarithmic scale for scoring.
Stores token and reversed token keys in the trie.
Saves the built trie to a cache file.
Usage:
tokenizer.loadDict_("/path/to/dictionary.txt")
Method: loadUserDict(self, fnm)
Purpose: Load a user-provided dictionary trie, or build it if missing.
Parameters:
fnm(str): Path to the user dictionary file (without.triesuffix).
Usage:
tokenizer.loadUserDict("/user/dict/path")
Method: addUserDict(self, fnm)
Purpose: Add entries from an additional dictionary file into the existing trie.
Parameters:
fnm(str): Path to the dictionary text file.
Usage:
tokenizer.addUserDict("/additional/dict.txt")
Method: _strQ2B(self, ustring)
Purpose: Convert full-width (double-byte) characters to half-width (single-byte) characters, common in East Asian typography normalization.
Parameters:
ustring(str): Input string possibly containing full-width characters.
Returns: (str) Normalized string with half-width characters.
Usage:
half_width_str = tokenizer._strQ2B(full_width_str)
Method: _tradi2simp(self, line)
Purpose: Convert traditional Chinese characters to simplified Chinese using
HanziConv.Parameters:
line(str): Input Chinese text.
Returns: (str) Simplified Chinese text.
Usage:
simplified_text = tokenizer._tradi2simp(traditional_text)
Method: dfs_(self, chars, s, preTks, tkslist, _depth=0, _memo=None)
Purpose: Perform a depth-first search segmentation of characters into token candidates using the trie dictionary.
Parameters:
chars(list of str): List of characters to segment.s(int): Current start index.preTks(list): Previously segmented tokens.tkslist(list): Accumulator for possible token segmentations._depth(int): Current recursion depth (default 0)._memo(dict): Memoization cache to avoid redundant computation.
Returns: (int) The farthest index reached during segmentation.
Details:
Handles repetitive character sequences specially.
Uses trie prefix checks to prune invalid segmentations.
Uses memoization to optimize repeated calls with the same state.
Usage:
tokens_list = [] tokenizer.dfs_(list("文本"), 0, [], tokens_list)
Method: freq(self, tk)
Purpose: Retrieve the original frequency of a token from the trie.
Parameters:
tk(str): Token string.
Returns: (int) Frequency count or 0 if not found.
Usage:
frequency = tokenizer.freq("token")
Method: tag(self, tk)
Purpose: Retrieve the tag (e.g., part-of-speech or category) associated with a token.
Parameters: Same as
freq.Returns: (str) Tag string or empty string if not found.
Usage:
token_tag = tokenizer.tag("token")
Method: score_(self, tfts)
Purpose: Score a token sequence based on frequency and length heuristics.
Parameters:
tfts(list of tuples): List of(token, (frequency, tag))pairs.
Returns: Tuple
(tokens, score), wheretokensis a list of token strings andscorea float score indicating segmentation quality.Usage:
tokens, score = tokenizer.score_([(token, (freq, tag)), ...])
Method: sortTks_(self, tkslist)
Purpose: Sort a list of token sequences by their computed score in descending order.
Parameters:
tkslist(list): List of token sequences with frequency/tag info.
Returns: Sorted list of
(tokens, score)tuples.Usage:
sorted_tokens = tokenizer.sortTks_(token_sequences)
Method: merge_(self, tks)
Purpose: Merge tokens that include split characters into combined tokens if they exist in the dictionary.
Parameters:
tks(str): String of tokens separated by spaces.
Returns: (str) Merged token string.
Usage:
merged = tokenizer.merge_("token1 token2 token3")
Method: maxForward_(self, line)
Purpose: Perform a maximum matching forward segmentation using the trie.
Parameters:
line(str): Input string.
Returns: Tuple
(tokens, score)fromscore_.Usage:
tokens, score = tokenizer.maxForward_("文本")
Method: maxBackward_(self, line)
Purpose: Perform a maximum matching backward segmentation using the reversed trie keys.
Parameters: Same as
maxForward_.Returns: Similar tuple with tokens and score.
Usage:
tokens, score = tokenizer.maxBackward_("文本")
Method: english_normalize_(self, tks)
Purpose: Normalize English tokens by applying lemmatization and stemming.
Parameters:
tks(list of str): List of tokens.
Returns: List of normalized tokens.
Usage:
normalized = tokenizer.english_normalize_(["running", "dogs"])
Method: _split_by_lang(self, line)
Purpose: Split a line of text into segments of homogeneous language (Chinese vs non-Chinese).
Parameters:
line(str): Input text line.
Returns: List of tuples
(segment, is_chinese)indicating language of each segment.Usage:
segments = tokenizer._split_by_lang("some 中文 text")
Method: tokenize(self, line)
Purpose: Main entry point to tokenize an input text line into a sequence of tokens.
Parameters:
line(str): Raw input text.
Returns: (str) Tokenized and merged string.
Details:
Normalizes text by converting full-width to half-width and traditional to simplified Chinese.
Splits text by language segments.
Uses forward and backward max matching with DFS refinement for Chinese.
Uses English tokenization and normalization for non-Chinese segments.
Usage:
tokens = tokenizer.tokenize("Some input text 中文文本")
Method: fine_grained_tokenize(self, tks)
Purpose: Further splits tokens into finer-grained units, especially useful if tokens contain subcomponents separated by
/or other delimiters.Parameters:
tks(str): Tokenized string.
Returns: (str) Fine-grained tokenized string.
Usage:
fine_tokens = tokenizer.fine_grained_tokenize("token1 token2")
Utility Functions
These functions operate outside of the class for character-type checks and simple token joining.
is_chinese(s)— ReturnsTrueif charactersis a Chinese character (Unicode range).is_number(s)— ReturnsTrueif charactersis a numeric digit.is_alphabet(s)— ReturnsTrueif charactersis an English alphabet letter.naiveQie(txt)— Naively inserts spaces between alphabetic tokens when adjacent tokens end and start with letters.
Module-level Objects and Functions
An instance of
RagTokenizeris created astokenizer.Wrapper functions are exposed for convenience:
tokenize = tokenizer.tokenizefine_grained_tokenize = tokenizer.fine_grained_tokenizetag = tokenizer.tagfreq = tokenizer.freqloadUserDict = tokenizer.loadUserDictaddUserDict = tokenizer.addUserDicttradi2simp = tokenizer._tradi2simpstrQ2B = tokenizer._strQ2B
Important Implementation Details and Algorithms
Trie-based dictionary: Uses
datrie.Triefor efficient prefix search of tokens, enabling fast lookup and segmentation decisions.Forward and backward maximum matching: The tokenizer segments text using both forward and backward maximum matching algorithms (
maxForward_andmaxBackward_) and compares results to improve segmentation accuracy.Depth-first search refinement: When forward and backward segmentations differ, a DFS approach (
dfs_) explores possible segmentations to find the best scoring tokenization.Frequency scoring: Token frequencies from dictionary files are converted to logarithmic scale for scoring segmentation quality.
Full-width to half-width conversion: Handles normalization of East Asian character widths.
Traditional to simplified Chinese conversion: Uses
hanziconvlibrary for Chinese character normalization.English token normalization: Applies stemming and lemmatization to English tokens using NLTK.
Language segmentation: Splits mixed-language text into homogeneous segments for language-appropriate tokenization.
Interaction with Other Parts of the System
Dependency on external libraries:
datriefor trie implementation.hanziconvfor Chinese character conversion.nltkfor English tokenization, stemming, and lemmatization.
File utilities: Uses
get_project_base_directoryfromapi.utils.file_utilsto resolve resource paths.Resource files:
Dictionary text files (
huqie.txt) and cached tries (huqie.txt.trie) located inrag/res/huqie.
Usage scenario:
Can be integrated into larger NLP pipelines requiring tokenization of Chinese-English mixed texts.
Supports user dictionaries to customize token sets.
Usage Example
from rag_tokenizer import RagTokenizer
tokenizer = RagTokenizer(debug=True)
text = "公开征求意见稿提出,境外投资者可使用自有人民币或外汇投资。"
tokens = tokenizer.tokenize(text)
print(tokens)
fine_tokens = tokenizer.fine_grained_tokenize(tokens)
print(fine_tokens)
Mermaid Class Diagram of RagTokenizer
classDiagram
class RagTokenizer {
-DEBUG: bool
-DENOMINATOR: int
-DIR_: str
-stemmer: PorterStemmer
-lemmatizer: WordNetLemmatizer
-SPLIT_CHAR: str
-trie_: datrie.Trie
+__init__(debug=False)
+key_(line) string
+rkey_(line) string
+loadDict_(fnm)
+loadUserDict(fnm)
+addUserDict(fnm)
+_strQ2B(ustring) string
+_tradi2simp(line) string
+dfs_(chars, s, preTks, tkslist, _depth=0, _memo=None) int
+freq(tk) int
+tag(tk) string
+score_(tfts) (list, float)
+sortTks_(tkslist) list
+merge_(tks) string
+maxForward_(line) (list, float)
+maxBackward_(line) (list, float)
+english_normalize_(tks) list
+_split_by_lang(line) list
+tokenize(line) string
+fine_grained_tokenize(tks) string
}
Summary
The rag_tokenizer.py file provides a powerful, dictionary-driven tokenizer specialized for Chinese text with support for English and numeric tokens. It leverages a trie data structure for efficient token lookup, combines forward and backward maximum matching algorithms, and applies linguistic normalization techniques. The class is extensible via user dictionaries and is designed to integrate into larger NLP systems for precise token segmentation and tagging.