rag_tokenizer.py

Overview

The rag_tokenizer.py file defines a sophisticated tokenizer class, RagTokenizer, designed primarily for processing multilingual text including Chinese (both simplified and traditional), English, and numeric/alphanumeric strings. It combines trie-based dictionary lookups, heuristic segmentation algorithms, and language-specific normalization techniques to perform fine-grained tokenization and tagging of text for natural language processing tasks.

This tokenizer is particularly tailored for Chinese text segmentation with fallback support for English stemming and lemmatization, making it suitable for applications such as information retrieval, text mining, machine learning preprocessing, or any system requiring accurate tokenization of mixed-language content.

Detailed Explanation of Components

Class: `RagTokenizer`

RagTokenizer encapsulates all logic related to loading token dictionaries, segmenting input text, and normalizing tokens. It utilizes a trie structure (datrie.Trie) for efficient prefix-based lookup of tokens and their associated frequency and tag information.

Initialization: `init(self, debug=False)`

Purpose: Initialize the tokenizer instance by loading the token dictionary trie from disk or building it from a dictionary text file if missing.
Parameters:
- debug (bool): Optional flag to enable debug logging.
Details:
- Sets up stemming and lemmatization tools (PorterStemmer, WordNetLemmatizer) for English.
- Defines regex patterns for splitting tokens by punctuation, alphanumeric groups, and special characters.
- Attempts to load a cached trie file (huqie.txt.trie), otherwise builds the trie from dictionary file huqie.txt.
Usage:
```
tokenizer = RagTokenizer(debug=True)
```

Method: `key_(self, line)`

Purpose: Generate a consistent trie key from a string by lowercasing and UTF-8 encoding.
Parameters:
- line (str): Input token string.
Returns: (str) Encoded key string for trie lookup.
Usage:
```
key = tokenizer.key_("Example")
```

Method: `rkey_(self, line)`

Purpose: Generate a reversed and prefixed trie key, used for backward trie searches.
Parameters: Same as key_.
Returns: (str) Reversed encoded key string.

Usage:

reversed_key = tokenizer.rkey_("Example")

Method: `loadDict_(self, fnm)`

Purpose: Load tokens from a dictionary file into the trie structure, calculating logarithmic frequency scores.
Parameters:
- fnm (str): Path to the dictionary text file.
Details:
- Reads lines, extracts token, frequency, and tag.
- Converts frequency to a logarithmic scale for scoring.
- Stores token and reversed token keys in the trie.
- Saves the built trie to a cache file.

Usage:

tokenizer.loadDict_("/path/to/dictionary.txt")

Method: `loadUserDict(self, fnm)`

Purpose: Load a user-provided dictionary trie, or build it if missing.
Parameters:
- fnm (str): Path to the user dictionary file (without .trie suffix).

Usage:

tokenizer.loadUserDict("/user/dict/path")

Method: `addUserDict(self, fnm)`

Purpose: Add entries from an additional dictionary file into the existing trie.
Parameters:
- fnm (str): Path to the dictionary text file.

Usage:

tokenizer.addUserDict("/additional/dict.txt")

Method: `_strQ2B(self, ustring)`

Purpose: Convert full-width (double-byte) characters to half-width (single-byte) characters, common in East Asian typography normalization.
Parameters:
- ustring (str): Input string possibly containing full-width characters.
Returns: (str) Normalized string with half-width characters.

Usage:

half_width_str = tokenizer._strQ2B(full_width_str)

Method: `_tradi2simp(self, line)`

Purpose: Convert traditional Chinese characters to simplified Chinese using HanziConv.
Parameters:
- line (str): Input Chinese text.
Returns: (str) Simplified Chinese text.

Usage:

simplified_text = tokenizer._tradi2simp(traditional_text)

Method: `dfs_(self, chars, s, preTks, tkslist, _depth=0, _memo=None)`

Purpose: Perform a depth-first search segmentation of characters into token candidates using the trie dictionary.
Parameters:
- chars (list of str): List of characters to segment.
- s (int): Current start index.
- preTks (list): Previously segmented tokens.
- tkslist (list): Accumulator for possible token segmentations.
- _depth (int): Current recursion depth (default 0).
- _memo (dict): Memoization cache to avoid redundant computation.
Returns: (int) The farthest index reached during segmentation.
Details:
- Handles repetitive character sequences specially.
- Uses trie prefix checks to prune invalid segmentations.
- Uses memoization to optimize repeated calls with the same state.

Usage:

tokens_list = []
tokenizer.dfs_(list("文本"), 0, [], tokens_list)

Method: `freq(self, tk)`

Purpose: Retrieve the original frequency of a token from the trie.
Parameters:
- tk (str): Token string.
Returns: (int) Frequency count or 0 if not found.
Usage:
```
frequency = tokenizer.freq("token")
```

Method: `tag(self, tk)`

Purpose: Retrieve the tag (e.g., part-of-speech or category) associated with a token.
Parameters: Same as freq.
Returns: (str) Tag string or empty string if not found.
Usage:
```
token_tag = tokenizer.tag("token")
```

Method: `score_(self, tfts)`

Purpose: Score a token sequence based on frequency and length heuristics.
Parameters:
- tfts (list of tuples): List of (token, (frequency, tag)) pairs.
Returns: Tuple (tokens, score), where tokens is a list of token strings and score a float score indicating segmentation quality.

Usage:

tokens, score = tokenizer.score_([(token, (freq, tag)), ...])

Method: `sortTks_(self, tkslist)`

Purpose: Sort a list of token sequences by their computed score in descending order.
Parameters:
- tkslist (list): List of token sequences with frequency/tag info.
Returns: Sorted list of (tokens, score) tuples.

Usage:

sorted_tokens = tokenizer.sortTks_(token_sequences)

Method: `merge_(self, tks)`

Purpose: Merge tokens that include split characters into combined tokens if they exist in the dictionary.
Parameters:
- tks (str): String of tokens separated by spaces.
Returns: (str) Merged token string.

Usage:

merged = tokenizer.merge_("token1 token2 token3")

Method: `maxForward_(self, line)`

Purpose: Perform a maximum matching forward segmentation using the trie.
Parameters:
- line (str): Input string.
Returns: Tuple (tokens, score) from score_.

Usage:

tokens, score = tokenizer.maxForward_("文本")

Method: `maxBackward_(self, line)`

Purpose: Perform a maximum matching backward segmentation using the reversed trie keys.
Parameters: Same as maxForward_.
Returns: Similar tuple with tokens and score.

Usage:

tokens, score = tokenizer.maxBackward_("文本")

Method: `english_normalize_(self, tks)`

Purpose: Normalize English tokens by applying lemmatization and stemming.
Parameters:
- tks (list of str): List of tokens.
Returns: List of normalized tokens.

Usage:

normalized = tokenizer.english_normalize_(["running", "dogs"])

Method: `_split_by_lang(self, line)`

Purpose: Split a line of text into segments of homogeneous language (Chinese vs non-Chinese).
Parameters:
- line (str): Input text line.
Returns: List of tuples (segment, is_chinese) indicating language of each segment.

Usage:

segments = tokenizer._split_by_lang("some 中文 text")

Method: `tokenize(self, line)`

Purpose: Main entry point to tokenize an input text line into a sequence of tokens.
Parameters:
- line (str): Raw input text.
Returns: (str) Tokenized and merged string.
Details:
- Normalizes text by converting full-width to half-width and traditional to simplified Chinese.
- Splits text by language segments.
- Uses forward and backward max matching with DFS refinement for Chinese.
- Uses English tokenization and normalization for non-Chinese segments.

Usage:

tokens = tokenizer.tokenize("Some input text 中文文本")

Method: `fine_grained_tokenize(self, tks)`

Purpose: Further splits tokens into finer-grained units, especially useful if tokens contain subcomponents separated by / or other delimiters.
Parameters:
- tks (str): Tokenized string.
Returns: (str) Fine-grained tokenized string.

Usage:

fine_tokens = tokenizer.fine_grained_tokenize("token1 token2")

Utility Functions

These functions operate outside of the class for character-type checks and simple token joining.

is_chinese(s) — Returns True if character s is a Chinese character (Unicode range).
is_number(s) — Returns True if character s is a numeric digit.
is_alphabet(s) — Returns True if character s is an English alphabet letter.
naiveQie(txt) — Naively inserts spaces between alphabetic tokens when adjacent tokens end and start with letters.

Module-level Objects and Functions

An instance of RagTokenizer is created as tokenizer.
Wrapper functions are exposed for convenience:
- tokenize = tokenizer.tokenize
- fine_grained_tokenize = tokenizer.fine_grained_tokenize
- tag = tokenizer.tag
- freq = tokenizer.freq
- loadUserDict = tokenizer.loadUserDict
- addUserDict = tokenizer.addUserDict
- tradi2simp = tokenizer._tradi2simp
- strQ2B = tokenizer._strQ2B

Important Implementation Details and Algorithms

Trie-based dictionary: Uses datrie.Trie for efficient prefix search of tokens, enabling fast lookup and segmentation decisions.
Forward and backward maximum matching: The tokenizer segments text using both forward and backward maximum matching algorithms (maxForward_ and maxBackward_) and compares results to improve segmentation accuracy.
Depth-first search refinement: When forward and backward segmentations differ, a DFS approach (dfs_) explores possible segmentations to find the best scoring tokenization.
Frequency scoring: Token frequencies from dictionary files are converted to logarithmic scale for scoring segmentation quality.
Full-width to half-width conversion: Handles normalization of East Asian character widths.
Traditional to simplified Chinese conversion: Uses hanziconv library for Chinese character normalization.
English token normalization: Applies stemming and lemmatization to English tokens using NLTK.
Language segmentation: Splits mixed-language text into homogeneous segments for language-appropriate tokenization.

Interaction with Other Parts of the System

Dependency on external libraries:
- datrie for trie implementation.
- hanziconv for Chinese character conversion.
- nltk for English tokenization, stemming, and lemmatization.
File utilities: Uses get_project_base_directory from api.utils.file_utils to resolve resource paths.
Resource files:
- Dictionary text files (huqie.txt) and cached tries (huqie.txt.trie) located in rag/res/huqie.
Usage scenario:
- Can be integrated into larger NLP pipelines requiring tokenization of Chinese-English mixed texts.
- Supports user dictionaries to customize token sets.

Usage Example

from rag_tokenizer import RagTokenizer

tokenizer = RagTokenizer(debug=True)
text = "公开征求意见稿提出，境外投资者可使用自有人民币或外汇投资。"
tokens = tokenizer.tokenize(text)
print(tokens)

fine_tokens = tokenizer.fine_grained_tokenize(tokens)
print(fine_tokens)

Mermaid Class Diagram of `RagTokenizer`

classDiagram
    class RagTokenizer {
        -DEBUG: bool
        -DENOMINATOR: int
        -DIR_: str
        -stemmer: PorterStemmer
        -lemmatizer: WordNetLemmatizer
        -SPLIT_CHAR: str
        -trie_: datrie.Trie
        +__init__(debug=False)
        +key_(line) string
        +rkey_(line) string
        +loadDict_(fnm)
        +loadUserDict(fnm)
        +addUserDict(fnm)
        +_strQ2B(ustring) string
        +_tradi2simp(line) string
        +dfs_(chars, s, preTks, tkslist, _depth=0, _memo=None) int
        +freq(tk) int
        +tag(tk) string
        +score_(tfts) (list, float)
        +sortTks_(tkslist) list
        +merge_(tks) string
        +maxForward_(line) (list, float)
        +maxBackward_(line) (list, float)
        +english_normalize_(tks) list
        +_split_by_lang(line) list
        +tokenize(line) string
        +fine_grained_tokenize(tks) string
    }

Summary

The rag_tokenizer.py file provides a powerful, dictionary-driven tokenizer specialized for Chinese text with support for English and numeric tokens. It leverages a trie data structure for efficient token lookup, combines forward and backward maximum matching algorithms, and applies linguistic normalization techniques. The class is extensible via user dictionaries and is designed to integrate into larger NLP systems for precise token segmentation and tagging.

rag_tokenizer.py

Overview

Detailed Explanation of Components

Class: RagTokenizer

Initialization: __init__(self, debug=False)

Method: key_(self, line)

Method: rkey_(self, line)

Method: loadDict_(self, fnm)

Method: loadUserDict(self, fnm)

Method: addUserDict(self, fnm)

Method: _strQ2B(self, ustring)

Method: _tradi2simp(self, line)

Method: dfs_(self, chars, s, preTks, tkslist, _depth=0, _memo=None)

Method: freq(self, tk)

Method: tag(self, tk)

Method: score_(self, tfts)

Method: sortTks_(self, tkslist)

Method: merge_(self, tks)

Method: maxForward_(self, line)

Method: maxBackward_(self, line)

Method: english_normalize_(self, tks)

Method: _split_by_lang(self, line)

Method: tokenize(self, line)

Method: fine_grained_tokenize(self, tks)

Utility Functions

Module-level Objects and Functions

Important Implementation Details and Algorithms

Interaction with Other Parts of the System

Usage Example

Mermaid Class Diagram of RagTokenizer

Summary

Class: `RagTokenizer`

Initialization: `init(self, debug=False)`

Method: `key_(self, line)`

Method: `rkey_(self, line)`

Method: `loadDict_(self, fnm)`

Method: `loadUserDict(self, fnm)`

Method: `addUserDict(self, fnm)`

Method: `_strQ2B(self, ustring)`

Method: `_tradi2simp(self, line)`

Method: `dfs_(self, chars, s, preTks, tkslist, _depth=0, _memo=None)`

Method: `freq(self, tk)`

Method: `tag(self, tk)`

Method: `score_(self, tfts)`

Method: `sortTks_(self, tkslist)`

Method: `merge_(self, tks)`

Method: `maxForward_(self, line)`

Method: `maxBackward_(self, line)`

Method: `english_normalize_(self, tks)`

Method: `_split_by_lang(self, line)`

Method: `tokenize(self, line)`

Method: `fine_grained_tokenize(self, tks)`

Mermaid Class Diagram of `RagTokenizer`