pysort

Overview

The file **[pysort](/projects/287/67683)** is a Python script designed to benchmark and compare the performance of JSON serialization (encoding) using two popular JSON libraries: Python's built-in `json` module and the high-performance Rust-backed `orjson` library. It focuses on timing the serialization process of a large JSON dataset in both unsorted and sorted key forms, measuring the time taken per iteration and comparing each library's performance relative to `orjson`.

This script is primarily used for performance testing and reporting, generating a tabulated summary of serialization latencies in milliseconds, helping developers understand the speed differences and efficiency gains offered by `orjson` compared to the standard library.

Detailed Explanation

Global Constants and Variables

dirname
Type: str
Description: Directory path pointing to the data folder, located one level up from the script's directory. This folder contains JSON fixture files used for benchmarking.
headers
Type: Tuple[str, str, str, str]
Description: Column headers for the summary table output. Columns are:
- Library name
- Time taken to serialize unsorted JSON (in ms)
- Time taken to serialize sorted JSON (in ms)
- Performance ratio vs. orjson sorted serialization time
LIBRARIES
Type: Tuple[str, ...]
Description: Tuple of library names to benchmark. Currently supports "orjson" and "json".
ITERATIONS
Type: int
Description: Number of times each serialization operation is repeated to obtain stable timing results.
data
Type: dict (loaded JSON object)
Description: The JSON data loaded from the compressed fixture file "twitter.json.xz" used as the benchmark input.

Functions

`read_fixture_obj(filename)`

def read_fixture_obj(filename) -> dict:

Purpose:
Load and parse a JSON fixture file into a Python dictionary. Supports transparent decompression if the file is .xz compressed.
Parameters:
- filename (str): Name of the fixture file to read (e.g., "twitter.json.xz").
Returns:
- dict: A Python object parsed from the JSON contents of the file.
Implementation Details:
- Constructs a full path to the fixture file inside the data directory.
- If the file has .xz suffix, decompresses it using lzma.decompress.
- Reads the raw bytes otherwise.
- Parses JSON content using orjson.loads for high-performance deserialization.
Usage Example:

data = read_fixture_obj("twitter.json.xz")

`per_iter_latency(val)`

def per_iter_latency(val: float) -> float | None:

Purpose:
Converts a total elapsed time for multiple iterations into a per-iteration latency in milliseconds.
Parameters:
- val (float or None): The total elapsed time in seconds for all iterations.
Returns:
- float or None: Time per iteration in milliseconds (ms), or None if input is None.
Implementation Details:
- Multiplies the total time by 1000 to convert seconds to milliseconds.
- Divides by the global ITERATIONS constant to get per iteration time.
Usage Example:

total_time = 0.5  # seconds for 500 iterations
latency_ms = per_iter_latency(total_time)  # returns 1.0 ms per iteration

Main Benchmarking Logic

The script performs the following steps:

Set CPU Affinity:
Restricts the process to CPU cores 0 and 1 using os.sched_setaffinity to reduce variability in timing results due to CPU scheduling.
Load Benchmark Data:
Reads and decompresses the "twitter.json.xz" JSON fixture into a Python object data.
Benchmark Loop Over Libraries:
Iterates over each library in LIBRARIES and performs two benchmarks per library:
- Unsorted Serialization: Serialize data without sorting keys.
- Sorted Serialization: Serialize data with keys sorted.
Timing Using timeit.timeit:
Measures the total time taken to serialize the JSON data ITERATIONS times.
Latency Calculation:
Converts total times into per-iteration latencies (in milliseconds).
Performance Comparison:
Computes the ratio of each library's sorted serialization time relative to orjson's sorted serialization time.
Tabulated Output:
Uses tabulate to format and print the results as a Markdown-style table.

Library-Specific Notes

json library:
Uses json.dumps followed by UTF-8 encoding (encode("utf-8")) to produce bytes, matching orjson.dumps output type.
orjson library:
Uses orjson.dumps. For sorted keys, passes the OPT_SORT_KEYS option.
Error Handling:
Raises NotImplementedError if an unsupported library name is specified.

Example Output

| Library | unsorted (ms) | sorted (ms) | vs. orjson |
|---------|---------------|-------------|------------|
| orjson  | 1.23          | 1.35        | 1.0        |
| json    | 15.67         | 17.89       | 13.3       |

Important Implementation Details

CPU Affinity:
By pinning the process to specific CPUs, the script reduces noise in timing results caused by CPU migration or interference from other system processes.
Compressed Fixture Loading:
The use of .xz compression for the JSON fixture reduces disk I/O time and storage space, while orjson provides fast decompression and deserialization.
Multiple Iterations for Timing:
Running serialization multiple times (ITERATIONS = 500) ensures that timing measurements are statistically significant and less susceptible to transient system load fluctuations.
Latency Normalization:
Reporting per-iteration latency in milliseconds makes the results more interpretable and comparable across different hardware and environments.

Interactions with Other Parts of the System

Data Fixtures:
The script depends on the presence of compressed JSON fixtures in the data directory relative to the project root. These fixtures are shared resources used across various benchmarking scripts.
Benchmarking Framework:
Although this script is standalone, it complements other benchmarking scripts in the project (such as bench/benchmark_dumps.py) that perform more detailed and automated benchmarking using pytest and pytest-benchmark.
JSON Libraries:
The script compares orjson (the project's Rust-backed JSON library) against Python's standard json module, providing direct performance comparisons that validate orjson's efficiency claims.
Result Reporting:
Uses the third-party tabulate package to produce human-readable tables, which can be further consumed by other reporting or visualization tools.

Mermaid Diagram: Flowchart of Main Functions and Workflow

flowchart TD
    A[Start Script] --> B[Set CPU Affinity to cores {0,1}]
    B --> C[Load JSON Fixture "twitter.json.xz"]
    C --> D{For each library in LIBRARIES}
    D --> E1[If library == "json"]
    D --> E2[If library == "orjson"]
    E1 --> F1[Time json.dumps (unsorted & sorted)]
    E2 --> F2[Time orjson.dumps (unsorted & sorted with OPT_SORT_KEYS)]
    F1 --> G[Calculate per-iteration latency]
    F2 --> G
    G --> H[Calculate performance ratio vs orjson]
    H --> I[Append results to table]
    I --> D
    D --> J[Format table with tabulate]
    J --> K[Print results]
    K --> L[End Script]

Summary

The [pysort](/projects/287/67683) script provides a concise and practical benchmarking tool focused on evaluating the serialization speed of JSON data between Python's built-in `json` library and the `orjson` library. It emphasizes consistent measurement through multiple iterations and CPU affinity management, producing clear tabular results that facilitate performance comparison and decision-making regarding JSON serialization strategies in Python applications.