High-Performance JSON Parsing
Overview
This module provides the core JSON parsing and writing functionalities implemented primarily in Rust, tightly integrated with an embedded ultra-fast C library called **yyjson**. The core goal is to achieve blazing-fast, memory-safe, and highly compliant JSON parsing and serialization. This module solves the problem of efficiently converting JSON byte streams into structured representations and back, with support for extensive JSON standards and edge cases.
Purpose and Problem Addressed
JSON parsing is a fundamental operation for many applications, but traditional JSON parsers often suffer from performance bottlenecks and insufficient handling of edge cases. This module addresses the need for:
High speed: Parsing and writing JSON faster than standard libraries.
Memory safety: Preventing errors such as buffer overflows or invalid memory access.
Full JSON compliance: Correctly parsing and writing complex JSON features, including deeply nested structures, Unicode escapes, number edge cases, and non-standard extensions (like comments or trailing commas when enabled).
Extensibility: Allowing integration with Rust and Python layers and supporting custom memory allocators.
Core Components and Workflows
Embedded yyjson C Library
At the heart of parsing and writing, the **yyjson** C library (`include/yyjson/yyjson.c` and `yyjson.h`) provides:
Efficient memory allocation: Implements custom pool and dynamic allocators to minimize allocation overhead.
Finite State Machine (FSM) based parsing: Uses
gotostatements for an optimized state machine to parse JSON tokens swiftly.Number parsing with IEEE-754 compliance: Implements a custom floating-point number reader and writer with precise rounding and support for special values like Infinity and
NaN.Unicode string reading and writing: Handles UTF-8 validation, escaped Unicode sequences, surrogate pairs, and invalid Unicode gracefully.
JSON Pointer, Patch, and Merge-Patch: Supports RFC 6901, RFC 6902, and RFC 7386 utilities for JSON document manipulation.
Support for both immutable and mutable JSON document/value representations: Allows efficient copying and mutation of JSON trees.
The `yyjson_read_opts()` function is the main entry point for parsing JSON data from a byte buffer with configurable allocator and options.
Rust Deserialization Backend
The Rust backend (`src/deserialize/backend/mod.rs`) conditionally selects between Rust-native JSON parsing and the embedded `yyjson` parser based on features enabled during compilation. When the `yyjson` feature is enabled, it exposes the `deserialize` functionality implemented on top of the embedded `yyjson` C library.
#[cfg(feature = "yyjson")]
mod yyjson;
#[cfg(feature = "yyjson")]
pub(crate) use yyjson::deserialize;
This design allows the Rust layer to leverage the high-performance C parsing engine while seamlessly integrating with Rust's memory safety and ownership models.
Key Functionalities and Workflows
JSON Parsing Process
Input Data Preparation: JSON data is copied into a buffer padded with extra bytes to prevent buffer overruns during parsing.
Parsing Entry Point: The
yyjson_read_opts()function receives the data buffer, length, allocator, and read options.Skipping Whitespace and Comments: The parser efficiently skips spaces and optionally C-style comments if enabled.
Finite State Machine Parsing: Using
goto-based FSM, the parser recognizes JSON tokens, including literals (true,false,null), numbers, strings, arrays, and objects.Number Parsing: The parser reads numbers with a custom algorithm supporting 64-bit integers and IEEE-754 doubles, handling rounding, large numbers, and special floating-point values with high accuracy and speed.
String Parsing: Strings are parsed with UTF-8 validation, handling escape sequences and surrogate pairs. Invalid Unicode can be optionally accepted.
Memory Management: Parsed JSON values and strings are stored in a contiguous memory pool managed by custom allocators to reduce fragmentation and overhead.
Building JSON Document: The parser constructs an immutable JSON document structure (
yyjson_doc) with references to parsed values (yyjson_val).
JSON Writing Process
The module also supports serialization of JSON values into UTF-8 encoded JSON strings with options for pretty-printing, Unicode escaping, and handling special floating-point values.
Number Writing: Uses a custom floating-point to decimal conversion algorithm (Schubfach) to produce shortest decimal representations with correct rounding.
String Writing: Supports efficient copying of ASCII and UTF-8 characters with optional escaping of Unicode and special characters.
Memory Allocation: Output buffers are managed with allocators, and memory is dynamically resized as needed.
Interaction with Other System Parts
The Rust deserialization backend (
src/deserialize/backend/mod.rs) exports the high-performance JSON parsing functions to other Rust modules.The Python integration layer calls into these Rust functions via FFI bindings, enabling Python users to parse JSON data using this efficient core.
Benchmarking scripts invoke these parsing functions to measure performance against other JSON libraries.
The custom allocators implemented in
yyjson.cinteract with Rust and Python memory management layers to ensure efficient memory usage.
Important Concepts and Design Patterns
Conditional Compilation and Feature Flags: The Rust backend switches between native Rust JSON parsing and embedded
yyjsonC parsing based on compile-time features.Custom Memory Allocators: Pool and dynamic allocators minimize memory overhead and fragmentation, tailored for single JSON document lifetimes.
Finite State Machine (FSM) Parsing: Parsing code uses explicit
gotostatements to implement a state machine, reducing function call overhead and improving branch prediction.IEEE-754 Floating-Point Handling: Implements precise floating-point parsing and formatting with correct rounding, using advanced algorithms and lookup tables.
UTF-8 Validation and Escaping: Efficient handling of UTF-8 sequences and escape characters ensures compliance and security.
Immutable and Mutable JSON Models: The library supports immutable documents for safe reading and mutable documents for building or modifying JSON structures.
Code Illustrations
Conditional Backend Selection in Rust
#[cfg(not(feature = "yyjson"))]
mod json;
#[cfg(feature = "yyjson")]
mod yyjson;
#[cfg(feature = "yyjson")]
pub(crate) use yyjson::deserialize;
#[cfg(not(feature = "yyjson"))]
pub(crate) use json::deserialize;
This snippet shows how the project conditionally uses the embedded `yyjson` parser for deserialization.
JSON Parsing Main Entry Function (C)
Excerpt from `yyjson_read_opts()` in `yyjson.c`:
yyjson_doc *yyjson_read_opts(char *dat,
usize len,
const yyjson_alc *alc_ptr,
yyjson_read_err *err) {
// ...
hdr = (u8 *)alc.malloc(alc.ctx, len + YYJSON_PADDING_SIZE);
memcpy(hdr, dat, len);
memset(end, 0, YYJSON_PADDING_SIZE);
if (likely(char_is_container(*cur))) {
if (char_is_space(cur[1]) && char_is_space(cur[2])) {
doc = read_root_pretty(hdr, cur, end, alc, err);
} else {
doc = read_root_minify(hdr, cur, end, alc, err);
}
} else {
doc = read_root_single(hdr, cur, end, alc, err);
}
// ...
return doc;
}
This illustrates how the parser chooses different parsing strategies based on input content and formatting.
Finite State Machine Parsing (Excerpt)
arr_val_begin:
if (*cur == '{') {
cur++;
goto obj_begin;
}
if (*cur == '[') {
cur++;
goto arr_begin;
}
if (char_is_number(*cur)) {
val_incr();
ctn_len++;
if (likely(read_number(&cur, val, &msg))) goto arr_val_end;
goto fail_number;
}
// ... other branches for strings, literals, whitespace, errors
This highlights the FSM approach with explicit `goto` targets for parsing arrays and objects.
Mermaid Diagram: JSON Parsing Flow
flowchart TD
Start[Start Parsing]
Allocate[Allocate Padded Buffer]
SkipWS[Skip Whitespace & Comments]
DetectRoot{Root Token}
ParseObj[Parse Object]
ParseArr[Parse Array]
ParseVal[Parse Single Value]
BuildDoc[Build JSON Document]
Finish[Return Document]
Start --> Allocate --> SkipWS --> DetectRoot
DetectRoot -->|{ Object| ParseObj
DetectRoot -->|[ Array| ParseArr
DetectRoot -->|Other| ParseVal
ParseObj --> BuildDoc
ParseArr --> BuildDoc
ParseVal --> BuildDoc
BuildDoc --> Finish
This diagram visualizes the high-level flow of JSON parsing in the embedded `yyjson` library, demonstrating the initial allocation, token detection, parsing branches, and final document construction.
This documentation page details the core concepts, workflows, and interactions of the High-Performance JSON Parsing module, emphasizing the embedded `yyjson` C library integration and the Rust backend's use of it for blazing-fast, memory-safe JSON parsing and serialization.