json_parser.py
Overview
The json_parser.py file provides functionality to parse, process, and split JSON and JSON Lines (JSONL) formatted data into manageable chunks. It is primarily focused on handling potentially large JSON objects by splitting them into smaller JSON fragments of configurable maximum and minimum sizes. This is useful in scenarios such as Natural Language Processing (NLP) pipelines, data ingestion, or any system that requires chunking large JSON data for efficient processing or storage.
The core component in this file is the RAGFlowJsonParser class, which offers methods to:
Detect if input text is in JSONL format.
Parse JSON or JSONL content from binary input.
Split complex nested JSON structures into smaller JSON chunks while preserving their structure.
Optionally convert lists within JSON data to dictionaries keyed by list indices, aiding structured splitting.
Return the resulting chunks either as JSON objects or serialized JSON strings.
This parser is adapted from Langchain's JSON text splitter with modifications tailored to the InfiniFlow project.
Classes and Methods
class RAGFlowJsonParser
Parser class for handling JSON and JSONL input, splitting large JSON objects into smaller chunks while preserving the nested structure.
__init__(self, max_chunk_size: int = 2000, min_chunk_size: int | None = None)
Initializes the parser with chunk size limits.
Parameters:
max_chunk_size(int): Desired maximum chunk size (in characters). Internally doubled to add buffer.min_chunk_size(Optional[int]): Desired minimum chunk size. Defaults to max of (max_chunk_size - 200) or 50 if not provided.
Behavior:
Sets chunk size limits for splitting JSON data.
__call__(self, binary) -> list[str]
Parse input binary data by detecting encoding and format, and split into JSON chunks.
Parameters:
binary(bytes): Binary input containing JSON or JSONL text.
Returns:
list[str]: List of JSON chunk strings.
Usage Example:
parser = RAGFlowJsonParser() with open("data.json", "rb") as f: chunks = parser(f.read())Details:
Decodes binary using encoding detected by
find_codec.Checks if input is JSONL or JSON.
Uses
_parse_jsonlor_parse_jsonaccordingly.
is_jsonl_format(self, txt: str, sample_limit: int = 10, threshold: float = 0.8) -> bool
Detects whether the input text is in JSON Lines (JSONL) format.
Parameters:
txt(str): Input text to check.sample_limit(int): Number of lines to sample for detection.threshold(float): Minimum ratio of valid JSON lines for detection (0-1).
Returns:
bool: True if text is likely JSONL, False otherwise.
Details:
Attempts to parse whole text as JSON; if successful, returns False.
Samples lines and counts how many parse as valid JSON objects.
Returns True if valid JSON lines exceed threshold.
_is_valid_json(self, line: str) -> bool
Helper to check if a single line is valid JSON.
Parameters:
line(str): Line of text.
Returns:
bool: True if valid JSON, False otherwise.
_parse_json(self, content: str) -> list[str]
Parse a JSON string and split into chunks.
Parameters:
content(str): JSON text.
Returns:
list[str]: List of JSON chunk strings.
Details:
Loads JSON string.
Uses
split_jsonwithconvert_lists=True.Serializes chunks to strings.
_parse_jsonl(self, content: str) -> list[str]
Parse JSON Lines (JSONL) formatted string and split into chunks.
Parameters:
content(str): JSONL text.
Returns:
list[str]: List of JSON chunk strings.
Details:
Splits input by lines.
Parses each line as JSON.
Splits each JSON object into chunks with
convert_lists=True.Collects all chunks from all lines.
split_json(self, json_data, convert_lists: bool = False) -> list[dict]
Splits a JSON object into smaller JSON dictionaries (chunks).
Parameters:
json_data(dictorlist): JSON object to split.convert_lists(bool): Whether to convert lists to dicts keyed by indices before splitting.
Returns:
list[dict]: List of JSON chunks as dictionaries.
Details:
If
convert_lists=True, converts all lists recursively to dicts for easier splitting.Calls
_json_splitto recursively split JSON while preserving structure.Removes empty last chunk if any.
split_text(self, json_data: dict[str, Any], convert_lists: bool = False, ensure_ascii: bool = True) -> list[str]
Splits JSON data into JSON formatted strings.
Parameters:
json_data(dict): JSON object to split.convert_lists(bool): Whether to convert lists to dicts.ensure_ascii(bool): Whether to escape non-ASCII characters.
Returns:
list[str]: List of JSON chunk strings.
Usage Example:
parser = RAGFlowJsonParser() chunks = parser.split_text(json_data=my_large_json) for chunk in chunks: print(chunk)
_json_size(data: dict) -> int (staticmethod)
Calculate the serialized size of a JSON object.
Parameters:
data(dict): JSON object.
Returns:
int: Number of characters in serialized JSON string.
_set_nested_dict(d: dict, path: list[str], value: Any) -> None (staticmethod)
Set a value in a nested dictionary at the given path, creating intermediate dictionaries as needed.
Parameters:
d(dict): Dictionary to modify.path(list[str]): List of keys representing the nested path.value(Any): Value to set.
_list_to_dict_preprocessing(self, data: Any) -> Any
Recursively converts lists in JSON data to dictionaries keyed by string indices.
Parameters:
data(Any): JSON data (dict, list, or primitive).
Returns:
Any: JSON data with lists converted to dicts.
Details:
This step helps to preserve list order while enabling recursive splitting by keys.
_json_split(self, data, current_path: list[str] | None, chunks: list[dict] | None) -> list[dict]
Recursively splits a JSON object into chunks of limited size, preserving the JSON structure.
Parameters:
data(dictor primitive): JSON data to split.current_path(list[str]or None): Current nested path within JSON.chunks(list[dict]or None): Accumulating list of chunks.
Returns:
list[dict]: List of JSON chunks (dictionaries).
Algorithm Details:
Iterates over keys in dictionaries.
Measures size of adding each key-value pair to current chunk.
If size exceeds
max_chunk_size, starts a new chunk if current chunk >=min_chunk_size.Recurses into nested values.
For primitives, sets value at the current path in the last chunk.
Important Implementation Details
Chunk size doubling: The
max_chunk_sizeprovided is internally doubled (self.max_chunk_size = max_chunk_size * 2) to allow for some buffer space during chunking.List to dict conversion: Lists are converted to dicts with string indices to facilitate recursive splitting by keys. This preserves order and structure for splitting.
Preserving JSON structure: The splitting algorithm retains the nested structure of the input JSON, avoiding flattening or losing hierarchy.
JSON Lines format detection: The parser uses heuristics to detect JSONL format by sampling lines and checking validity.
Graceful error handling: JSON decode errors during parsing lines or overall content are caught and skipped, allowing partial processing.
Interaction with Other Components
Imports
find_codecfrom therag.nlpmodule, which presumably detects the character encoding of the input binary data.Designed as a utility/parser component that can be used in data ingestion or NLP pipelines.
Can be invoked directly with binary data or via its methods to split JSON objects.
Likely used in conjunction with other system components that process or index JSON data in chunks.
Usage Example
from json_parser import RAGFlowJsonParser
parser = RAGFlowJsonParser(max_chunk_size=1000)
with open("large_data.json", "rb") as f:
chunks = parser(f.read())
for idx, chunk in enumerate(chunks):
print(f"Chunk {idx}: {chunk}")
Mermaid Class Diagram
classDiagram
class RAGFlowJsonParser {
-int max_chunk_size
-int min_chunk_size
+__init__(max_chunk_size: int = 2000, min_chunk_size: int | None = None)
+__call__(binary) list~str~
+is_jsonl_format(txt: str, sample_limit: int = 10, threshold: float = 0.8) bool
+split_json(json_data, convert_lists: bool = False) list~dict~
+split_text(json_data: dict, convert_lists: bool = False, ensure_ascii: bool = True) list~str~
-_parse_json(content: str) list~str~
-_parse_jsonl(content: str) list~str~
-_json_split(data, current_path: list~str~ | None, chunks: list~dict~ | None) list~dict~
-_list_to_dict_preprocessing(data: Any) Any
+_json_size(data: dict) int
+_set_nested_dict(d: dict, path: list~str~, value: Any) None
-_is_valid_json(line: str) bool
}