pyindent

Overview

The `pyindent` script is a standalone Python benchmarking utility designed to measure and compare the serialization performance of JSON encoding libraries, specifically `orjson` and Python's built-in `json` module. It operates on compressed JSON fixture files, decompressing and loading them into Python objects, then timing how long each library takes to serialize the data in both compact and pretty-printed (indented) formats.

This script outputs a formatted comparison table showing the time taken (in milliseconds) per iteration for each library and option, along with relative performance ratios compared to `orjson`.

**Key features:**

Reads JSON fixture files from a predefined data directory.
Supports loading compressed .xz JSON files.
Benchmarks serialization with both compact and pretty-print options.
Validates serialization correctness by deserializing the output and comparing it with the original data.
Uses CPU affinity to restrict execution to specific cores for consistent benchmarking.
Outputs benchmark results in a human-readable GitHub-flavored markdown table.

Detailed Explanation of Components

Global Variables

dirname
Absolute path to the data directory containing JSON fixture files (one directory level above the script directory).
filename
The JSON fixture filename passed as the first command-line argument (without extension). Defaults to empty string if not provided.
LIBRARIES
Tuple of library names to benchmark: ("orjson", "json").
headers
Tuple defining table header columns for output display:
("Library", "compact (ms)", "pretty (ms)", "vs. orjson").
ITERATIONS
Number of iterations to run each serialization test, calculated to ensure a minimum total runtime of approximately 2 seconds for orjson compact serialization for statistical reliability.

Functions

`read_fixture_obj(filename) -> object`

Reads a JSON fixture file (possibly compressed) from the `data` directory and returns the deserialized Python object.

Parameters:
- filename (str): The name of the fixture file (e.g., "canada.json.xz").
Returns:
- Deserialized Python object representing the JSON content.
Details:
- If the file extension is .xz, decompresses using lzma before loading.
- Uses orjson.loads for fast JSON parsing.

Example Usage:

data = read_fixture_obj("canada.json.xz")

`per_iter_latency(val) -> float | None`

Converts a total elapsed time value into a per-iteration latency in milliseconds.

Parameters:
- val (float or None): Total time in seconds for all iterations.
Returns:
- Per-iteration latency in milliseconds ((val * 1000) / ITERATIONS).
- Returns None if input is None.

Example Usage:

total_time = 1.5  # seconds for all iterations
latency_ms = per_iter_latency(total_time)  # e.g., 1.5 * 1000 / ITERATIONS

`test_correctness(serialized: bytes) -> bool`

Checks if a serialized JSON byte string correctly represents the original `data` object by deserializing and comparing.

Parameters:
- serialized (bytes): Serialized JSON bytes.
Returns:
- True if deserialized output equals the original data object, else False.

Example Usage:

serialized = orjson.dumps(data)
is_correct = test_correctness(serialized)

Main Execution Flow

Set CPU affinity:
Limits the process to run on CPU cores 0 and 1 for more consistent timing.
Load JSON fixture:
Reads the JSON data object from the specified fixture file (expects filename.json.xz).
Calculate output sizes:
Measures the size (in KiB) of orjson serialized output in both compact and pretty formats.
Determine iterations:
Runs a quick timeit benchmark on orjson.dumps(data) to compute how many iterations to perform to achieve roughly 2 seconds total runtime.
Benchmark loop:
For each library in LIBRARIES:
- Measures serialization time for compact and pretty-printed JSON.
- Validates correctness by deserializing the pretty output.
- Calculates per-iteration latency.
- Computes a ratio comparing the library's pretty serialization time to orjson's.
Output results:
Prints a GitHub markdown table with benchmark results including time in milliseconds and relative performance.

Usage Example

Assuming the script is named `pyindent` and a JSON fixture file `example.json.xz` exists in the data directory, run:

./pyindent example

Sample output:

150KiB compact, 180KiB pretty, 100 iterations
orjson...
json...

| Library | compact (ms) | pretty (ms) | vs. orjson |
|---------|--------------|-------------|------------|
| orjson  | 15.23        | 18.45       | 1.0        |
| json    | 120.55       | 150.32      | 8.1        |

Important Implementation Details

CPU Affinity:
The script sets CPU affinity (os.sched_setaffinity) to cores {0,1} to reduce noise from OS scheduling and improve reproducibility.
Time Measurement:
Uses Python's timeit function with a lambda that performs repeated serialization calls to average out timing variability.
Dynamic Iteration Count:
The number of iterations is dynamically calculated based on a short initial test to ensure the benchmark runs long enough (minimum ~2 seconds for orjson compact serialization).
Correctness Validation:
After serialization, re-parsing the output bytes ensures no data corruption or serialization bugs.
Compression Support:
Supports .xz compressed JSON fixtures transparently via lzma.
Output Formatting:
Uses tabulate to produce clean GitHub-style markdown tables for easy reading and integration into reports or markdown documents.

Interaction with Other System Parts

Data Directory:
Reads input JSON fixture files from the data directory located one level up relative to the script. These fixtures are shared with other benchmarking scripts and utilities in the project.
orjson Library:
Uses orjson for both parsing fixtures and benchmarking serialization, highlighting its performance advantages over the standard json module.
tabulate Library:
Used for formatting output tables.
Benchmarking Ecosystem:
The script complements other benchmark scripts in the project (e.g., bench/benchmark_dumps.py) by providing a quick, focused serialization benchmark with detailed timing and correctness checks.

Visual Diagram: Flowchart of `pyindent` Script Execution

flowchart TD
    A[Start: Parse command-line filename] --> B[Read compressed JSON fixture]
    B --> C[Decompress if .xz, then parse JSON]
    C --> D[Calculate output sizes using orjson.dumps]
    D --> E[Determine ITERATIONS via timeit on orjson.dumps]
    E --> F{For each library in (orjson, json)}
    F --> G[Time compact serialization]
    F --> H[Time pretty serialization]
    G --> I[Test correctness by deserializing pretty output]
    H --> I
    I --> J[Calculate per-iteration latencies]
    J --> K[Compute relative performance vs. orjson]
    K --> L[Collect results in table]
    L --> M[Print formatted markdown table]
    M --> N[End]

Summary

The `pyindent` script is a targeted benchmark tool for measuring JSON serialization speed and correctness between `orjson` and Python’s built-in `json` library using compressed JSON fixtures. Its design ensures precise and reproducible performance metrics with dynamic iteration scaling and correctness verification, making it a valuable utility within the broader benchmarking suite of the project.

pyindent

Overview

Detailed Explanation of Components

Global Variables

Functions

read_fixture_obj(filename) -> object

per_iter_latency(val) -> float | None

test_correctness(serialized: bytes) -> bool

Main Execution Flow

Usage Example

Important Implementation Details

Interaction with Other System Parts

Visual Diagram: Flowchart of pyindent Script Execution

Summary

`read_fixture_obj(filename) -> object`

`per_iter_latency(val) -> float | None`

`test_correctness(serialized: bytes) -> bool`

Visual Diagram: Flowchart of `pyindent` Script Execution