json_parser.py


Overview

The json_parser.py file provides functionality to parse, process, and split JSON and JSON Lines (JSONL) formatted data into manageable chunks. It is primarily focused on handling potentially large JSON objects by splitting them into smaller JSON fragments of configurable maximum and minimum sizes. This is useful in scenarios such as Natural Language Processing (NLP) pipelines, data ingestion, or any system that requires chunking large JSON data for efficient processing or storage.

The core component in this file is the RAGFlowJsonParser class, which offers methods to:

This parser is adapted from Langchain's JSON text splitter with modifications tailored to the InfiniFlow project.


Classes and Methods

class RAGFlowJsonParser

Parser class for handling JSON and JSONL input, splitting large JSON objects into smaller chunks while preserving the nested structure.


__init__(self, max_chunk_size: int = 2000, min_chunk_size: int | None = None)

Initializes the parser with chunk size limits.


__call__(self, binary) -> list[str]

Parse input binary data by detecting encoding and format, and split into JSON chunks.


is_jsonl_format(self, txt: str, sample_limit: int = 10, threshold: float = 0.8) -> bool

Detects whether the input text is in JSON Lines (JSONL) format.


_is_valid_json(self, line: str) -> bool

Helper to check if a single line is valid JSON.


_parse_json(self, content: str) -> list[str]

Parse a JSON string and split into chunks.


_parse_jsonl(self, content: str) -> list[str]

Parse JSON Lines (JSONL) formatted string and split into chunks.


split_json(self, json_data, convert_lists: bool = False) -> list[dict]

Splits a JSON object into smaller JSON dictionaries (chunks).


split_text(self, json_data: dict[str, Any], convert_lists: bool = False, ensure_ascii: bool = True) -> list[str]

Splits JSON data into JSON formatted strings.


_json_size(data: dict) -> int (staticmethod)

Calculate the serialized size of a JSON object.


_set_nested_dict(d: dict, path: list[str], value: Any) -> None (staticmethod)

Set a value in a nested dictionary at the given path, creating intermediate dictionaries as needed.


_list_to_dict_preprocessing(self, data: Any) -> Any

Recursively converts lists in JSON data to dictionaries keyed by string indices.


_json_split(self, data, current_path: list[str] | None, chunks: list[dict] | None) -> list[dict]

Recursively splits a JSON object into chunks of limited size, preserving the JSON structure.


Important Implementation Details


Interaction with Other Components


Usage Example

from json_parser import RAGFlowJsonParser

parser = RAGFlowJsonParser(max_chunk_size=1000)

with open("large_data.json", "rb") as f:
    chunks = parser(f.read())

for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx}: {chunk}")

Mermaid Class Diagram

classDiagram
    class RAGFlowJsonParser {
        -int max_chunk_size
        -int min_chunk_size
        +__init__(max_chunk_size: int = 2000, min_chunk_size: int | None = None)
        +__call__(binary) list~str~
        +is_jsonl_format(txt: str, sample_limit: int = 10, threshold: float = 0.8) bool
        +split_json(json_data, convert_lists: bool = False) list~dict~
        +split_text(json_data: dict, convert_lists: bool = False, ensure_ascii: bool = True) list~str~
        -_parse_json(content: str) list~str~
        -_parse_jsonl(content: str) list~str~
        -_json_split(data, current_path: list~str~ | None, chunks: list~dict~ | None) list~dict~
        -_list_to_dict_preprocessing(data: Any) Any
        +_json_size(data: dict) int
        +_set_nested_dict(d: dict, path: list~str~, value: Any) None
        -_is_valid_json(line: str) bool
    }

End of Documentation for json_parser.py