json_parser.py

Overview

The json_parser.py file provides functionality to parse, process, and split JSON and JSON Lines (JSONL) formatted data into manageable chunks. It is primarily focused on handling potentially large JSON objects by splitting them into smaller JSON fragments of configurable maximum and minimum sizes. This is useful in scenarios such as Natural Language Processing (NLP) pipelines, data ingestion, or any system that requires chunking large JSON data for efficient processing or storage.

The core component in this file is the RAGFlowJsonParser class, which offers methods to:

Detect if input text is in JSONL format.
Parse JSON or JSONL content from binary input.
Split complex nested JSON structures into smaller JSON chunks while preserving their structure.
Optionally convert lists within JSON data to dictionaries keyed by list indices, aiding structured splitting.
Return the resulting chunks either as JSON objects or serialized JSON strings.

This parser is adapted from Langchain's JSON text splitter with modifications tailored to the InfiniFlow project.

Classes and Methods

`class RAGFlowJsonParser`

Parser class for handling JSON and JSONL input, splitting large JSON objects into smaller chunks while preserving the nested structure.

`init(self, max_chunk_size: int = 2000, min_chunk_size: int | None = None)`

Initializes the parser with chunk size limits.

Parameters:
- max_chunk_size (int): Desired maximum chunk size (in characters). Internally doubled to add buffer.
- min_chunk_size (Optional[int]): Desired minimum chunk size. Defaults to max of (max_chunk_size - 200) or 50 if not provided.
Behavior:
- Sets chunk size limits for splitting JSON data.

`call(self, binary) -> list[str]`

Parse input binary data by detecting encoding and format, and split into JSON chunks.

Parameters:
- binary (bytes): Binary input containing JSON or JSONL text.
Returns:
- list[str]: List of JSON chunk strings.

Usage Example:

parser = RAGFlowJsonParser()
with open("data.json", "rb") as f:
    chunks = parser(f.read())

Details:
- Decodes binary using encoding detected by find_codec.
- Checks if input is JSONL or JSON.
- Uses _parse_jsonl or _parse_json accordingly.

`is_jsonl_format(self, txt: str, sample_limit: int = 10, threshold: float = 0.8) -> bool`

Detects whether the input text is in JSON Lines (JSONL) format.

Parameters:
- txt (str): Input text to check.
- sample_limit (int): Number of lines to sample for detection.
- threshold (float): Minimum ratio of valid JSON lines for detection (0-1).
Returns:
- bool: True if text is likely JSONL, False otherwise.
Details:
- Attempts to parse whole text as JSON; if successful, returns False.
- Samples lines and counts how many parse as valid JSON objects.
- Returns True if valid JSON lines exceed threshold.

`_is_valid_json(self, line: str) -> bool`

Helper to check if a single line is valid JSON.

Parameters:
- line (str): Line of text.
Returns:
- bool: True if valid JSON, False otherwise.

`_parse_json(self, content: str) -> list[str]`

Parse a JSON string and split into chunks.

Parameters:
- content (str): JSON text.
Returns:
- list[str]: List of JSON chunk strings.
Details:
- Loads JSON string.
- Uses split_json with convert_lists=True.
- Serializes chunks to strings.

`_parse_jsonl(self, content: str) -> list[str]`

Parse JSON Lines (JSONL) formatted string and split into chunks.

Parameters:
- content (str): JSONL text.
Returns:
- list[str]: List of JSON chunk strings.
Details:
- Splits input by lines.
- Parses each line as JSON.
- Splits each JSON object into chunks with convert_lists=True.
- Collects all chunks from all lines.

`split_json(self, json_data, convert_lists: bool = False) -> list[dict]`

Splits a JSON object into smaller JSON dictionaries (chunks).

Parameters:
- json_data (dict or list): JSON object to split.
- convert_lists (bool): Whether to convert lists to dicts keyed by indices before splitting.
Returns:
- list[dict]: List of JSON chunks as dictionaries.
Details:
- If convert_lists=True, converts all lists recursively to dicts for easier splitting.
- Calls _json_split to recursively split JSON while preserving structure.
- Removes empty last chunk if any.

`split_text(self, json_data: dict[str, Any], convert_lists: bool = False, ensure_ascii: bool = True) -> list[str]`

Splits JSON data into JSON formatted strings.

Parameters:
- json_data (dict): JSON object to split.
- convert_lists (bool): Whether to convert lists to dicts.
- ensure_ascii (bool): Whether to escape non-ASCII characters.
Returns:
- list[str]: List of JSON chunk strings.

Usage Example:

parser = RAGFlowJsonParser()
chunks = parser.split_text(json_data=my_large_json)
for chunk in chunks:
    print(chunk)

`_json_size(data: dict) -> int` (staticmethod)

Calculate the serialized size of a JSON object.

Parameters:
- data (dict): JSON object.
Returns:
- int: Number of characters in serialized JSON string.

`_set_nested_dict(d: dict, path: list[str], value: Any) -> None` (staticmethod)

Set a value in a nested dictionary at the given path, creating intermediate dictionaries as needed.

Parameters:
- d (dict): Dictionary to modify.
- path (list[str]): List of keys representing the nested path.
- value (Any): Value to set.

`_list_to_dict_preprocessing(self, data: Any) -> Any`

Recursively converts lists in JSON data to dictionaries keyed by string indices.

Parameters:
- data (Any): JSON data (dict, list, or primitive).
Returns:
- Any: JSON data with lists converted to dicts.
Details:
- This step helps to preserve list order while enabling recursive splitting by keys.

`_json_split(self, data, current_path: list[str] | None, chunks: list[dict] | None) -> list[dict]`

Recursively splits a JSON object into chunks of limited size, preserving the JSON structure.

Parameters:
- data (dict or primitive): JSON data to split.
- current_path (list[str] or None): Current nested path within JSON.
- chunks (list[dict] or None): Accumulating list of chunks.
Returns:
- list[dict]: List of JSON chunks (dictionaries).
Algorithm Details:
- Iterates over keys in dictionaries.
- Measures size of adding each key-value pair to current chunk.
- If size exceeds max_chunk_size, starts a new chunk if current chunk >= min_chunk_size.
- Recurses into nested values.
- For primitives, sets value at the current path in the last chunk.

Important Implementation Details

Chunk size doubling: The max_chunk_size provided is internally doubled (self.max_chunk_size = max_chunk_size * 2) to allow for some buffer space during chunking.
List to dict conversion: Lists are converted to dicts with string indices to facilitate recursive splitting by keys. This preserves order and structure for splitting.
Preserving JSON structure: The splitting algorithm retains the nested structure of the input JSON, avoiding flattening or losing hierarchy.
JSON Lines format detection: The parser uses heuristics to detect JSONL format by sampling lines and checking validity.
Graceful error handling: JSON decode errors during parsing lines or overall content are caught and skipped, allowing partial processing.

Interaction with Other Components

Imports find_codec from the rag.nlp module, which presumably detects the character encoding of the input binary data.
Designed as a utility/parser component that can be used in data ingestion or NLP pipelines.
Can be invoked directly with binary data or via its methods to split JSON objects.
Likely used in conjunction with other system components that process or index JSON data in chunks.

Usage Example

from json_parser import RAGFlowJsonParser

parser = RAGFlowJsonParser(max_chunk_size=1000)

with open("large_data.json", "rb") as f:
    chunks = parser(f.read())

for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx}: {chunk}")

Mermaid Class Diagram

classDiagram
    class RAGFlowJsonParser {
        -int max_chunk_size
        -int min_chunk_size
        +__init__(max_chunk_size: int = 2000, min_chunk_size: int | None = None)
        +__call__(binary) list~str~
        +is_jsonl_format(txt: str, sample_limit: int = 10, threshold: float = 0.8) bool
        +split_json(json_data, convert_lists: bool = False) list~dict~
        +split_text(json_data: dict, convert_lists: bool = False, ensure_ascii: bool = True) list~str~
        -_parse_json(content: str) list~str~
        -_parse_jsonl(content: str) list~str~
        -_json_split(data, current_path: list~str~ | None, chunks: list~dict~ | None) list~dict~
        -_list_to_dict_preprocessing(data: Any) Any
        +_json_size(data: dict) int
        +_set_nested_dict(d: dict, path: list~str~, value: Any) None
        -_is_valid_json(line: str) bool
    }

End of Documentation for json_parser.py

json_parser.py

Overview

Classes and Methods

class RAGFlowJsonParser

__init__(self, max_chunk_size: int = 2000, min_chunk_size: int | None = None)

__call__(self, binary) -> list[str]

is_jsonl_format(self, txt: str, sample_limit: int = 10, threshold: float = 0.8) -> bool

_is_valid_json(self, line: str) -> bool

_parse_json(self, content: str) -> list[str]

_parse_jsonl(self, content: str) -> list[str]

split_json(self, json_data, convert_lists: bool = False) -> list[dict]

split_text(self, json_data: dict[str, Any], convert_lists: bool = False, ensure_ascii: bool = True) -> list[str]

_json_size(data: dict) -> int (staticmethod)

_set_nested_dict(d: dict, path: list[str], value: Any) -> None (staticmethod)

_list_to_dict_preprocessing(self, data: Any) -> Any

_json_split(self, data, current_path: list[str] | None, chunks: list[dict] | None) -> list[dict]