init.py
Overview
This init.py file is part of the InfiniFlow project and serves as a utility module providing several helper functions and decorators commonly used throughout the codebase. Its primary role is to supply foundational utilities for token encoding/decoding, singleton pattern enforcement, whitespace cleanup in strings, and reading maximum values from files. It also sets up the environment for token encoding caching by configuring the tiktoken library.
The file contains no classes but includes multiple standalone functions and one decorator function. It interacts with other parts of the system primarily through the tiktoken library for token operations and through importing a utility function get_project_base_directory() from the project's API utilities.
Detailed Description of Functions and Decorators
1. singleton(cls, *args, **kw)
A decorator to enforce the singleton pattern on a class, ensuring only one instance of the class exists per process.
Parameters:
cls: The class to be made singleton.*args,**kw: Arguments and keyword arguments to be passed to the class constructor.
Returns:
A function
_singletonthat returns the single instance of the class.
Usage Example:
@singleton class MyClass: def __init__(self, val): self.val = val a = MyClass(10) b = MyClass(20) assert a is b # True, both references point to the same instanceImplementation Detail:
Uses a dictionary
instanceskeyed by a combination of the class name and the current process ID (os.getpid()), enabling singleton behavior scoped to each process.This approach is process-safe in multi-processing environments, avoiding shared instances across processes.
2. rmSpace(txt: str) -> str
Removes excessive spaces around certain characters in a string, improving formatting by eliminating unwanted whitespace.
Parameters:
txt(str): The input text string.
Returns:
A string with spaces removed in specific contexts.
Implementation Detail:
Uses two regular expressions to remove spaces that appear before or after characters that are not alphanumeric or common punctuation (.,)> or (<).
Case-insensitive replacement.
Usage Example:
input_text = "Hello , world > this is a test" cleaned_text = rmSpace(input_text) print(cleaned_text) # Output: "Hello, world> this is a test"
3. findMaxDt(fnm: str) -> str
Reads a file line by line and returns the maximum date-time string found, assuming the file contains date-time entries in string format.
Parameters:
fnm(str): Path to the file to read.
Returns:
The maximum date-time string found in the file.
Defaults to
"1970-01-01 00:00:00"if file is empty or on error.
Implementation Detail:
Skips lines that are
'nan'.Compares lines lexicographically (suitable if date-times are in ISO 8601 format).
Ignores exceptions silently (failsafe).
Usage Example:
max_date = findMaxDt("dates.txt") print(max_date) # e.g., "2024-06-01 12:30:45"
4. findMaxTm(fnm: str) -> int
Reads a file line by line and returns the maximum integer value found, ignoring 'nan' lines.
Parameters:
fnm(str): Path to the file to read.
Returns:
The maximum integer found.
Defaults to
0if file is empty or on error.
Implementation Detail:
Converts lines to integers.
Ignores exceptions silently.
Usage Example:
max_time = findMaxTm("times.txt") print(max_time) # e.g., 34567
5. num_tokens_from_string(string: str) -> int
Calculates the number of tokens in a given string using the tiktoken tokenizer.
Parameters:
string(str): Input string to tokenize.
Returns:
Integer count of tokens.
Returns 0 if encoding fails.
Usage Example:
text = "Hello world" count = num_tokens_from_string(text) print(count) # e.g., 2Implementation Detail:
Uses a pre-configured encoder
cl100k_basefrom thetiktokenpackage.Exceptions during encoding result in a zero return value.
6. truncate(string: str, max_len: int) -> str
Truncates a string to a maximum number of tokens, preserving token boundaries.
Parameters:
string(str): The input string.max_len(int): Maximum number of tokens allowed.
Returns:
A string truncated to
max_lentokens.
Usage Example:
long_text = "This is a very long text that might need truncation." truncated_text = truncate(long_text, 5) print(truncated_text) # Output will contain first 5 tokens decoded back to string
7. clean_markdown_block(text: str) -> str
Cleans a fenced Markdown code block that uses the markdown language tag by removing the opening and closing backticks and stripping whitespace.
Parameters:
text(str): The input markdown block string.
Returns:
Cleaned string without the ```markdown fences.
Usage Example:
md_block = "```markdown\n# Title\nSome *markdown* content\n```" cleaned = clean_markdown_block(md_block) print(cleaned) # Output: "# Title\nSome *markdown* content"Implementation Detail:
Uses regular expressions to remove opening and closing fenced code block markers.
8. get_float(v) -> float
Safely converts a value to float, returning negative infinity if conversion fails or value is None.
Parameters:
v: Value to convert.
Returns:
Float value or
float('-inf')on failure.
Usage Example:
print(get_float("3.14")) # 3.14 print(get_float(None)) # -inf print(get_float("abc")) # -inf
Important Implementation Details
The singleton decorator uses process IDs (
os.getpid()) in the key, ensuring singleton instances are unique per process, which is especially useful in multiprocessing environments.The file sets up the
TIKTOKEN_CACHE_DIRenvironment variable to the project's base directory, ensuring tokenization caching is consistent and centralized.The tokenizer used is
cl100k_base, a base encoding compatible with OpenAI models.The
findMaxDtandfindMaxTmfunctions are designed to safely handle files that may contain'nan'entries or be malformed, failing silently and returning default values.
Interaction with Other Parts of the System
tiktokenlibrary: This file initializes and configures the tokenizer for counting and truncating text tokens. It impacts any feature in the system that requires token-based text processing, such as prompt preparation for language models.api.utils.file_utils.get_project_base_directory(): Used to retrieve the base directory of the project to configure cache paths for tokenizers.Other modules can import these utility functions and the singleton decorator to enforce design patterns and consistent text/token handling across the application.
Mermaid Class Diagram
As this file contains no classes, but several functions and a decorator, a flowchart illustrating function relationships and usage is more appropriate.
flowchart TD
A[Module Initialization]
A --> B[Set TIKTOKEN_CACHE_DIR env var]
A --> C[Initialize tiktoken encoder]
subgraph Decorator
D[singleton(cls, *args, **kw)]
end
subgraph Text Utilities
E[rmSpace(txt)]
F[clean_markdown_block(text)]
G[get_float(v)]
end
subgraph File Utilities
H[findMaxDt(fnm)]
I[findMaxTm(fnm)]
end
subgraph Tokenizer Functions
J[num_tokens_from_string(string)]
K[truncate(string, max_len)]
end
B --> C
C --> J
C --> K
D --> |Decorator| A
E -->|Input Text| A
F -->|Input Text| A
G -->|Input Value| A
H -->|Input File| A
I -->|Input File| A
J -->|Uses encoder| C
K -->|Uses encoder| C
Summary
This init.py utility module provides foundational helpers for token processing, singleton pattern implementation, string formatting fixes, and file content analysis useful across the InfiniFlow codebase. It configures the tokenization environment and exposes simple, robust functions that abstract common operations, promoting code reuse and consistency.