utils.py
Overview
The utils.py file provides utility functions to support text file reading operations within the InfiniFlow project. Its primary functionality is to read the contents of a file either from a binary input or from a file path on disk, returning the contents as a Unicode string. This utility abstracts away encoding detection and reading logic, facilitating smoother interaction with text data in different formats.
Functions
get_text(fnm: str, binary=None) -> str
Reads and returns the content of a text file or a binary text source as a string.
Parameters:
fnm (
str): The filename or file path to read from. This parameter is used only ifbinaryis not provided.binary (
bytes, optional): If provided, this binary data will be decoded and returned as text instead of reading from a file.
Returns:
str: The decoded text content from the file or binary input.
Description:
If the
binaryparameter is given, the function usesfind_codec()from therag.nlpmodule to detect the appropriate encoding of the binary data, then decodes it to a string, ignoring any decoding errors.If
binaryis not provided, the function opens the file specified byfnmin text mode ("r"), reads its contents line-by-line, and concatenates them into a single string.The method uses a while True loop with
readline()for reading file lines until EOF.
Usage Example:
from utils import get_text
# Read text from a file path
content = get_text("example.txt")
print(content)
# Read text from binary data
binary_data = b'\xff\xfeH\x00e\x00l\x00l\x00o\x00'
text = get_text("ignored_filename.txt", binary=binary_data)
print(text) # Output depends on detected encoding
Implementation Details
Encoding Detection: The function leverages
find_codec()from therag.nlpmodule to automatically detect the encoding of binary data. This avoids issues related to unknown or varying encodings when processing binary text inputs.Error Handling: When decoding binary data, it uses
errors="ignore"to skip any undecodable bytes silently, preventing exceptions due to malformed or partial encodings.File Reading: For file reading, the function reads line by line using
readline()inside a loop until no more lines are available. This approach is memory efficient for large files.
Interaction with Other Modules
rag.nlp.find_codec: This function is imported and used to determine the correct text encoding of binary input data. This interaction is critical for correctly interpreting binary text sources.File I/O: The function interacts with the file system by opening and reading text files when binary input is not provided.
This utility function is likely used by other modules in the InfiniFlow project wherever robust text loading from either files or binary blobs is needed.
Diagram: Function Flowchart
flowchart TD
A[get_text(fnm: str, binary=None)]
A -->|binary provided| B[find_codec(binary) -> encoding]
B --> C[binary.decode(encoding, errors="ignore")]
C --> D[Return decoded text]
A -->|binary not provided| E[Open file fnm in read mode]
E --> F[Read lines one by one]
F --> G[Concatenate lines into txt]
G --> H[Return txt]
Summary
utils.py provides a simple yet flexible utility function get_text to abstract text data reading from files or binary input. Its integration with encoding detection and error-tolerant decoding makes it suitable for diverse text data sources. This file serves as a foundational utility for handling text input consistently across the InfiniFlow system.