pynonstr

Overview

The `pynonstr` script is a standalone Python benchmarking utility focused on comparing the performance of JSON serialization across two libraries: the standard Python `json` module and the high-performance `orjson` library. It specifically tests serialization of large datasets containing mixed dictionary keys, including both integer and string keys, which is a common challenge for JSON serializers.

The script generates synthetic date-indexed data spanning 100 years (1920–2019), each year represented by a dictionary mapping timestamps (integer keys) and a string key `"other"` to integer values. It benchmarks serialization latency for:

Dictionaries with string keys
Dictionaries with integer keys (unordered)
Dictionaries with integer keys sorted by key

The benchmarking results are then formatted into a human-readable table using the `tabulate` library.

This file is designed to run as a standalone program and does not define reusable classes or functions outside its main benchmarking flow.

Detailed Explanation of Key Components

Global Constants and Variables

data_as_obj
A list of 100 dictionaries, each representing a year's worth of daily date entries.
Each dictionary maps:
- Integer keys: UNIX timestamps (seconds since epoch) for each day of the year (365 entries).
- String key "other" mapped to 0.
  The dictionaries are shuffled to randomize key order.
data_as_str
A JSON-deserialized deep copy of data_as_obj where dictionary keys are normalized to strings, achieved via serializing with orjson and deserializing with orjson.loads.
This simulates the common JSON limitation where keys are strings.
headers
Tuple defining column headers for the benchmark output table:
("Library", "str keys (ms)", "int keys (ms)", "int keys sorted (ms)")
LIBRARIES
Tuple of library names to benchmark: ("orjson", "json")
ITERATIONS
Number of iterations to run timing benchmarks per test, set to 500.

Affinity Setting

os.sched_setaffinity(os.getpid(), {0, 1})

Sets CPU affinity for the current process to cores 0 and 1 to reduce variability in timing measurements due to OS scheduling.

Data Generation Logic

for year in range(1920, 2020):
    start = datetime.date(year, 1, 1)
    array = [
        (int(mktime((start + datetime.timedelta(days=i)).timetuple())), i + 1)
        for i in range(365)
    ]
    array.append(("other", 0))
    random.shuffle(array)
    data_as_obj.append(dict(array))

For each year from 1920 to 2019 (100 years total), generate 365 timestamp keys mapped to day counts (1 to 365).
Append a string key "other" with value 0.
Shuffle the list to randomize key order before converting to a dictionary.
Store each year's dictionary in data_as_obj.

This simulates large JSON-like data structures with mixed key types and unordered keys.

Serialization and Timing Utilities

`per_iter_latency(val)`

def per_iter_latency(val):
    if val is None:
        return None
    return (val * 1000) / ITERATIONS

Converts total elapsed time from seconds to average milliseconds per iteration.
Returns None if input is None.

`test_correctness(serialized)`

def test_correctness(serialized):
    return orjson.loads(serialized) == data_as_str

Verifies that the serialized JSON, when deserialized, matches the expected data_as_str.
Ensures correctness of serialization output (particularly for orjson).

Benchmark Execution Loop

The script iterates over each library in `LIBRARIES`:

For json library:
- Serialize data_as_str (string keys) and data_as_obj (integer keys) using json.dumps, then encode to UTF-8 bytes.
- Sorting keys benchmark for integer-keyed dictionaries is skipped (set to None) due to TypeError when comparing mixed key types in Python's JSON sorter.
- Sets correctness flag to False since sorted keys serialization is not runnable.
For orjson library:
- Serialize data_as_str with orjson.dumps and OPT_NON_STR_KEYS option to allow integer keys.
- Serialize data_as_obj with orjson.dumps and OPT_NON_STR_KEYS.
- Serialize sorted keys version with OPT_NON_STR_KEYS | OPT_SORT_KEYS.
- Validate correctness by deserializing the sorted keys output.

Each timing is measured with `timeit` over `ITERATIONS` (500) runs.

The latencies are converted to milliseconds per iteration and stored in a results table.

Output

Prints the size of the serialized output from orjson in KiB.
Prints a formatted Markdown table displaying benchmark results, showing the average latency in milliseconds for each library and serialization mode:

Library	str keys (ms)	int keys (ms)	int keys sorted (ms)
orjson	...	...	...
json	...	...

Usage Example

Run the script directly from the command line:

./pynonstr

Expected output includes:

Size of the serialized JSON output in KiB.
Per-library serialization timing results for string keys, integer keys, and integer keys sorted (where supported).
Correctness validation is implicitly performed for orjson.

Important Implementation Details and Algorithms

Handling Non-String Keys:
JSON standard mandates string keys. orjson supports non-string keys via OPT_NON_STR_KEYS, allowing integer keys to be serialized directly, which is benchmarked here. The standard json module cannot handle mixed key types with sorting.
Sorting Keys with Mixed Types:
Sorting integer and string keys together raises TypeError in Python’s default JSON module due to incomparable types. The benchmark skips this for json but includes it for orjson which handles sorting with OPT_SORT_KEYS.
Random Key Order:
Shuffling key order simulates real-world unordered dictionaries, stressing the serialization library's handling of key ordering and sorting.
CPU Affinity:
Setting CPU affinity reduces noise in timing benchmarks caused by process scheduling across CPUs.
Use of timeit for Timing:
Measures total elapsed time for repeated serialization calls to ensure statistically meaningful results.

Interaction with Other Parts of the System

This script is part of a suite of benchmarking tools designed to evaluate serialization/deserialization performance and correctness across various JSON libraries, primarily comparing orjson and Python's built-in json module.
It uses orjson and tabulate modules which are dependencies of the project’s benchmarking framework.
The generated data mimics realistic JSON structures used in other benchmarking scripts in the project, enabling consistent cross-script comparisons.
It complements other benchmark scripts by focusing on serialization performance with mixed key types and sorting options.
The results could be used as input for visualization or reporting components in the broader benchmarking system.

Mermaid Class/Component Diagram

Since this file is a standalone utility script without defined classes or components, a **flowchart** representing the main functions and their relationships is most appropriate:

flowchart TD
    A[Generate Data (data_as_obj)] --> B[Shuffle Keys and Create Dict]
    B --> C[Convert to JSON Object with String Keys (data_as_str)]
    C --> D[Set CPU Affinity to cores 0 & 1]
    D --> E[Measure Serialized Size (orjson)]
    E --> F[Benchmark Loop Over Libraries]
    F --> G{Library == "json"?}
    G -- Yes --> H[Time json.dumps on str keys and int keys]
    G -- No --> I[Time orjson.dumps with OPT_NON_STR_KEYS and OPT_SORT_KEYS]
    H & I --> J[Test Correctness (orjson only)]
    J --> K[Calculate Per-Iteration Latency]
    K --> L[Store Results in Table]
    L --> M[Print Tabulated Benchmark Report]

Summary

`pynonstr` is a focused Python benchmark script designed to evaluate JSON serialization speed and correctness, especially handling mixed key types and sorting, comparing the high-performance `orjson` library against the standard Python `json` module. It uses synthetic date-keyed data, CPU affinity settings, and repeated timing measures to provide reliable latency metrics, outputting a formatted comparison table. This script complements a larger benchmarking project aimed at validating and showcasing orjson’s performance advantages in realistic serialization scenarios.