data.py

Overview

The `data.py` file serves as a fundamental configuration and utility module within the benchmarking suite for JSON serialization and deserialization libraries. Its primary role is to:

Define and expose the JSON serialization/deserialization functions of the two main libraries under test: orjson (a Rust-backed JSON library) and Python’s built-in json module.
Provide a consistent interface by mapping library names to their respective dump and load function pairs.
List the JSON fixture filenames used across the benchmark tests to ensure standardized and repeatable performance measurements.

This setup facilitates easy parametrization of benchmarks over multiple libraries and datasets, enabling consistent comparisons between different JSON processing implementations.

Detailed Explanation of Components

Imports

from json import dumps as _json_dumps
from json import loads as json_loads

from orjson import dumps as orjson_dumps
from orjson import loads as orjson_loads

Imports serialization (dumps) and deserialization (loads) functions from both the standard Python json module and the high-performance orjson library.
Aliases are used to avoid naming conflicts (_json_dumps for the built-in json.dumps).

Function: `json_dumps`

def json_dumps(obj):
    return _json_dumps(obj).encode("utf-8")

Purpose: Wraps the standard library’s json.dumps to return a UTF-8 encoded bytes object instead of a str.
Parameters:
- obj: Any Python object serializable by the standard json module.
Returns: bytes — UTF-8 encoded JSON string.
Usage Example:

data = {"key": "value"}
serialized_bytes = json_dumps(data)
print(type(serialized_bytes))  # <class 'bytes'>

Implementation Detail:
Python’s built-in json.dumps returns a string, but the benchmarking suite expects bytes for uniformity (matching orjson.dumps which returns bytes). This wrapper ensures the Python json variant conforms to this interface.

Variable: `libraries`

libraries = {
    "orjson": (orjson_dumps, orjson_loads),
    "json": (json_dumps, json_loads),
}

Purpose:
A dictionary mapping string identifiers of JSON libraries to tuples of their respective (dump_function, load_function).
Entries:
- "orjson": Uses orjson.dumps and orjson.loads directly.
- "json": Uses the wrapped json_dumps (returns bytes) and json.loads.
Usage:
This dictionary is used for parametrized benchmarks, allowing iteration over libraries by name and uniform calling of serialization/deserialization functions.
Example Usage in Benchmarks:

for lib_name, (dumper, loader) in libraries.items():
    serialized = dumper(data)
    deserialized = loader(serialized)

Variable: `fixtures`

fixtures = [
    "canada.json",
    "citm_catalog.json",
    "github.json",
    "twitter.json",
]

Purpose:
List of JSON fixture filenames used in benchmarks.
Details:
These fixtures represent diverse and realistically sized JSON datasets, typically stored as compressed files (e.g., .xz) in the benchmark data directory. They cover various data structures and complexities to provide comprehensive performance testing scenarios.
Usage:
Benchmark scripts load these fixtures, deserialize them into Python objects, and perform serialization/deserialization tests using the libraries defined in libraries.
Example:

for fixture in fixtures:
    data = read_fixture_obj(fixture + ".xz")  # Utility function to load fixture
    for lib_name, (dumper, loader) in libraries.items():
        serialized = dumper(data)
        deserialized = loader(serialized)
        # Perform benchmarking or correctness checks

Implementation Details and Algorithms

The file explicitly ensures interface consistency between JSON libraries by wrapping Python's json.dumps to return bytes, matching orjson.dumps.
The libraries dictionary abstracts library-specific implementations enabling flexible and scalable benchmarking code.
Fixtures are listed as simple strings, assuming external utility functions handle actual file I/O and decompression, keeping data.py lightweight and focused on data definitions.
No complex algorithms or classes exist here; the file acts as a foundational configuration and utility layer.

Interaction with Other System Components

Benchmarking Scripts:
Scripts such as bench/benchmark_dumps.py and bench/benchmark_loads.py import libraries and fixtures to parameterize test runs across multiple JSON libraries and JSON datasets.
Fixture Loading Utilities:
External utility modules (e.g., bench/util.py) provide functions like read_fixture and read_fixture_obj that read and decompress the JSON fixture files named in fixtures.
Benchmark Execution:
During benchmark runs, the selected library's dump and load functions are called as per the mapping in libraries to serialize and deserialize the fixture data, enabling performance and correctness measurements.
Test Parametrization:
The lists and mappings in this file support the use of pytest.mark.parametrize decorators, allowing systematic iteration over all combinations of fixtures and libraries.

Mermaid Class Diagram

Since this file contains no classes but only functions and variables, a **flowchart** illustrating the functional structure and relationships is more appropriate.

flowchart TD
    A[json_dumps(obj)] -->|wraps| B[json.dumps(obj) -> str]
    B --> C[encode("utf-8") -> bytes]

    subgraph Libraries
        D[orjson_dumps(obj) -> bytes]
        E[orjson_loads(bytes) -> obj]
        F[json_dumps(obj) -> bytes]
        G[json_loads(str) -> obj]
    end

    H[libraries dict]
    H --> D
    H --> E
    H --> F
    H --> G

    I[fixtures list]
    I --> J[Fixture filenames: canada.json, citm_catalog.json, ...]

Summary

data.py is a lightweight utility and configuration module for the benchmarking system.
It standardizes JSON serialization/deserialization interfaces for orjson and Python's json.
Defines the key datasets (fixtures) used across benchmark tests.
Enables scalable, parametrized benchmarking by abstracting library-specific details.
Interacts primarily with benchmark scripts and fixture loading utilities to support performance and correctness testing of JSON libraries.

Usage Example in Benchmark Code

import data

for fixture_name in data.fixtures:
    # Load data object from fixture (e.g., using utility function)
    obj = read_fixture_obj(fixture_name + ".xz")

    for lib_name, (dumper, loader) in data.libraries.items():
        serialized = dumper(obj)
        deserialized = loader(serialized)
        assert deserialized == obj  # correctness check

This example shows how the `data.py` definitions enable concise and uniform benchmark implementations.

End of Documentation for `data.py`