init.py

Overview

This init.py file is part of the InfiniFlow project and serves as a utility module providing several helper functions and decorators commonly used throughout the codebase. Its primary role is to supply foundational utilities for token encoding/decoding, singleton pattern enforcement, whitespace cleanup in strings, and reading maximum values from files. It also sets up the environment for token encoding caching by configuring the tiktoken library.

The file contains no classes but includes multiple standalone functions and one decorator function. It interacts with other parts of the system primarily through the tiktoken library for token operations and through importing a utility function get_project_base_directory() from the project's API utilities.

Detailed Description of Functions and Decorators

1. `singleton(cls, *args, **kw)`

A decorator to enforce the singleton pattern on a class, ensuring only one instance of the class exists per process.

Parameters:
- cls: The class to be made singleton.
- *args, **kw: Arguments and keyword arguments to be passed to the class constructor.
Returns:
- A function _singleton that returns the single instance of the class.

Usage Example:

@singleton
class MyClass:
    def __init__(self, val):
        self.val = val

a = MyClass(10)
b = MyClass(20)
assert a is b  # True, both references point to the same instance

Implementation Detail:
- Uses a dictionary instances keyed by a combination of the class name and the current process ID (os.getpid()), enabling singleton behavior scoped to each process.
- This approach is process-safe in multi-processing environments, avoiding shared instances across processes.

2. `rmSpace(txt: str) -> str`

Removes excessive spaces around certain characters in a string, improving formatting by eliminating unwanted whitespace.

Parameters:
- txt (str): The input text string.
Returns:
- A string with spaces removed in specific contexts.
Implementation Detail:
- Uses two regular expressions to remove spaces that appear before or after characters that are not alphanumeric or common punctuation (.,)> or (<).
- Case-insensitive replacement.

Usage Example:

input_text = "Hello ,  world > this  is a test"
cleaned_text = rmSpace(input_text)
print(cleaned_text)
# Output: "Hello, world> this is a test"

3. `findMaxDt(fnm: str) -> str`

Reads a file line by line and returns the maximum date-time string found, assuming the file contains date-time entries in string format.

Parameters:
- fnm (str): Path to the file to read.
Returns:
- The maximum date-time string found in the file.
- Defaults to "1970-01-01 00:00:00" if file is empty or on error.
Implementation Detail:
- Skips lines that are 'nan'.
- Compares lines lexicographically (suitable if date-times are in ISO 8601 format).
- Ignores exceptions silently (failsafe).

Usage Example:

max_date = findMaxDt("dates.txt")
print(max_date)  # e.g., "2024-06-01 12:30:45"

4. `findMaxTm(fnm: str) -> int`

Reads a file line by line and returns the maximum integer value found, ignoring 'nan' lines.

Parameters:
- fnm (str): Path to the file to read.
Returns:
- The maximum integer found.
- Defaults to 0 if file is empty or on error.
Implementation Detail:
- Converts lines to integers.
- Ignores exceptions silently.

Usage Example:

max_time = findMaxTm("times.txt")
print(max_time)  # e.g., 34567

5. `num_tokens_from_string(string: str) -> int`

Calculates the number of tokens in a given string using the tiktoken tokenizer.

Parameters:
- string (str): Input string to tokenize.
Returns:
- Integer count of tokens.
- Returns 0 if encoding fails.

Usage Example:

text = "Hello world"
count = num_tokens_from_string(text)
print(count)  # e.g., 2

Implementation Detail:
- Uses a pre-configured encoder cl100k_base from the tiktoken package.
- Exceptions during encoding result in a zero return value.

6. `truncate(string: str, max_len: int) -> str`

Truncates a string to a maximum number of tokens, preserving token boundaries.

Parameters:
- string (str): The input string.
- max_len (int): Maximum number of tokens allowed.
Returns:
- A string truncated to max_len tokens.

Usage Example:

long_text = "This is a very long text that might need truncation."
truncated_text = truncate(long_text, 5)
print(truncated_text)  # Output will contain first 5 tokens decoded back to string

7. `clean_markdown_block(text: str) -> str`

Cleans a fenced Markdown code block that uses the markdown language tag by removing the opening and closing backticks and stripping whitespace.

Parameters:
- text (str): The input markdown block string.
Returns:
- Cleaned string without the ```markdown fences.

Usage Example:

md_block = "```markdown\n# Title\nSome *markdown* content\n```"
cleaned = clean_markdown_block(md_block)
print(cleaned)  # Output: "# Title\nSome *markdown* content"

Implementation Detail:
- Uses regular expressions to remove opening and closing fenced code block markers.

8. `get_float(v) -> float`

Safely converts a value to float, returning negative infinity if conversion fails or value is None.

Parameters:
- v: Value to convert.
Returns:
- Float value or float('-inf') on failure.

Usage Example:

print(get_float("3.14"))  # 3.14
print(get_float(None))    # -inf
print(get_float("abc"))   # -inf

Important Implementation Details

The singleton decorator uses process IDs (os.getpid()) in the key, ensuring singleton instances are unique per process, which is especially useful in multiprocessing environments.
The file sets up the TIKTOKEN_CACHE_DIR environment variable to the project's base directory, ensuring tokenization caching is consistent and centralized.
The tokenizer used is cl100k_base, a base encoding compatible with OpenAI models.
The findMaxDt and findMaxTm functions are designed to safely handle files that may contain 'nan' entries or be malformed, failing silently and returning default values.

Interaction with Other Parts of the System

tiktoken library: This file initializes and configures the tokenizer for counting and truncating text tokens. It impacts any feature in the system that requires token-based text processing, such as prompt preparation for language models.
api.utils.file_utils.get_project_base_directory(): Used to retrieve the base directory of the project to configure cache paths for tokenizers.
Other modules can import these utility functions and the singleton decorator to enforce design patterns and consistent text/token handling across the application.

Mermaid Class Diagram

As this file contains no classes, but several functions and a decorator, a flowchart illustrating function relationships and usage is more appropriate.

flowchart TD
    A[Module Initialization]
    A --> B[Set TIKTOKEN_CACHE_DIR env var]
    A --> C[Initialize tiktoken encoder]

    subgraph Decorator
        D[singleton(cls, *args, **kw)]
    end

    subgraph Text Utilities
        E[rmSpace(txt)]
        F[clean_markdown_block(text)]
        G[get_float(v)]
    end

    subgraph File Utilities
        H[findMaxDt(fnm)]
        I[findMaxTm(fnm)]
    end

    subgraph Tokenizer Functions
        J[num_tokens_from_string(string)]
        K[truncate(string, max_len)]
    end

    B --> C
    C --> J
    C --> K

    D --> |Decorator| A
    E -->|Input Text| A
    F -->|Input Text| A
    G -->|Input Value| A
    H -->|Input File| A
    I -->|Input File| A
    J -->|Uses encoder| C
    K -->|Uses encoder| C

Summary

This init.py utility module provides foundational helpers for token processing, singleton pattern implementation, string formatting fixes, and file content analysis useful across the InfiniFlow codebase. It configures the tokenization environment and exposes simple, robust functions that abstract common operations, promoting code reuse and consistency.

init.py

Overview

Detailed Description of Functions and Decorators

1. singleton(cls, *args, **kw)

2. rmSpace(txt: str) -> str

3. findMaxDt(fnm: str) -> str

4. findMaxTm(fnm: str) -> int

5. num_tokens_from_string(string: str) -> int

6. truncate(string: str, max_len: int) -> str

7. clean_markdown_block(text: str) -> str

8. get_float(v) -> float

Important Implementation Details

Interaction with Other Parts of the System

Mermaid Class Diagram

Summary

End of Documentation

1. `singleton(cls, *args, **kw)`

2. `rmSpace(txt: str) -> str`

3. `findMaxDt(fnm: str) -> str`

4. `findMaxTm(fnm: str) -> int`

5. `num_tokens_from_string(string: str) -> int`

6. `truncate(string: str, max_len: int) -> str`

7. `clean_markdown_block(text: str) -> str`

8. `get_float(v) -> float`