pynonstr
Overview
The `pynonstr` script is a standalone Python benchmarking utility focused on comparing the performance of JSON serialization across two libraries: the standard Python `json` module and the high-performance `orjson` library. It specifically tests serialization of large datasets containing mixed dictionary keys, including both integer and string keys, which is a common challenge for JSON serializers.
The script generates synthetic date-indexed data spanning 100 years (1920–2019), each year represented by a dictionary mapping timestamps (integer keys) and a string key `"other"` to integer values. It benchmarks serialization latency for:
Dictionaries with string keys
Dictionaries with integer keys (unordered)
Dictionaries with integer keys sorted by key
The benchmarking results are then formatted into a human-readable table using the `tabulate` library.
This file is designed to run as a standalone program and does not define reusable classes or functions outside its main benchmarking flow.
Detailed Explanation of Key Components
Global Constants and Variables
data_as_obj
A list of 100 dictionaries, each representing a year's worth of daily date entries.
Each dictionary maps:Integer keys: UNIX timestamps (seconds since epoch) for each day of the year (365 entries).
String key
"other"mapped to0.
The dictionaries are shuffled to randomize key order.
data_as_str
A JSON-deserialized deep copy ofdata_as_objwhere dictionary keys are normalized to strings, achieved via serializing withorjsonand deserializing withorjson.loads.
This simulates the common JSON limitation where keys are strings.headers
Tuple defining column headers for the benchmark output table:("Library", "str keys (ms)", "int keys (ms)", "int keys sorted (ms)")LIBRARIES
Tuple of library names to benchmark:("orjson", "json")ITERATIONS
Number of iterations to run timing benchmarks per test, set to 500.
Affinity Setting
os.sched_setaffinity(os.getpid(), {0, 1})
Sets CPU affinity for the current process to cores 0 and 1 to reduce variability in timing measurements due to OS scheduling.
Data Generation Logic
for year in range(1920, 2020):
start = datetime.date(year, 1, 1)
array = [
(int(mktime((start + datetime.timedelta(days=i)).timetuple())), i + 1)
for i in range(365)
]
array.append(("other", 0))
random.shuffle(array)
data_as_obj.append(dict(array))
For each year from 1920 to 2019 (100 years total), generate 365 timestamp keys mapped to day counts (1 to 365).
Append a string key
"other"with value 0.Shuffle the list to randomize key order before converting to a dictionary.
Store each year's dictionary in
data_as_obj.
This simulates large JSON-like data structures with mixed key types and unordered keys.
Serialization and Timing Utilities
per_iter_latency(val)
def per_iter_latency(val):
if val is None:
return None
return (val * 1000) / ITERATIONS
Converts total elapsed time from seconds to average milliseconds per iteration.
Returns
Noneif input isNone.
test_correctness(serialized)
def test_correctness(serialized):
return orjson.loads(serialized) == data_as_str
Verifies that the serialized JSON, when deserialized, matches the expected
data_as_str.Ensures correctness of serialization output (particularly for
orjson).
Benchmark Execution Loop
The script iterates over each library in `LIBRARIES`:
For
jsonlibrary:Serialize
data_as_str(string keys) anddata_as_obj(integer keys) usingjson.dumps, then encode to UTF-8 bytes.Sorting keys benchmark for integer-keyed dictionaries is skipped (set to
None) due toTypeErrorwhen comparing mixed key types in Python's JSON sorter.Sets correctness flag to
Falsesince sorted keys serialization is not runnable.
For
orjsonlibrary:Serialize
data_as_strwithorjson.dumpsandOPT_NON_STR_KEYSoption to allow integer keys.Serialize
data_as_objwithorjson.dumpsandOPT_NON_STR_KEYS.Serialize sorted keys version with
OPT_NON_STR_KEYS | OPT_SORT_KEYS.Validate correctness by deserializing the sorted keys output.
Each timing is measured with `timeit` over `ITERATIONS` (500) runs.
The latencies are converted to milliseconds per iteration and stored in a results table.
Output
Prints the size of the serialized output from orjson in KiB.
Prints a formatted Markdown table displaying benchmark results, showing the average latency in milliseconds for each library and serialization mode:
Library | str keys (ms) | int keys (ms) | int keys sorted (ms) |
|---|---|---|---|
orjson | ... | ... | ... |
json | ... | ... |
Usage Example
Run the script directly from the command line:
./pynonstr
Expected output includes:
Size of the serialized JSON output in KiB.
Per-library serialization timing results for string keys, integer keys, and integer keys sorted (where supported).
Correctness validation is implicitly performed for
orjson.
Important Implementation Details and Algorithms
Handling Non-String Keys:
JSON standard mandates string keys.orjsonsupports non-string keys viaOPT_NON_STR_KEYS, allowing integer keys to be serialized directly, which is benchmarked here. The standardjsonmodule cannot handle mixed key types with sorting.Sorting Keys with Mixed Types:
Sorting integer and string keys together raisesTypeErrorin Python’s default JSON module due to incomparable types. The benchmark skips this forjsonbut includes it fororjsonwhich handles sorting withOPT_SORT_KEYS.Random Key Order:
Shuffling key order simulates real-world unordered dictionaries, stressing the serialization library's handling of key ordering and sorting.CPU Affinity:
Setting CPU affinity reduces noise in timing benchmarks caused by process scheduling across CPUs.Use of
timeitfor Timing:
Measures total elapsed time for repeated serialization calls to ensure statistically meaningful results.
Interaction with Other Parts of the System
This script is part of a suite of benchmarking tools designed to evaluate serialization/deserialization performance and correctness across various JSON libraries, primarily comparing
orjsonand Python's built-injsonmodule.It uses
orjsonandtabulatemodules which are dependencies of the project’s benchmarking framework.The generated data mimics realistic JSON structures used in other benchmarking scripts in the project, enabling consistent cross-script comparisons.
It complements other benchmark scripts by focusing on serialization performance with mixed key types and sorting options.
The results could be used as input for visualization or reporting components in the broader benchmarking system.
Mermaid Class/Component Diagram
Since this file is a standalone utility script without defined classes or components, a **flowchart** representing the main functions and their relationships is most appropriate:
flowchart TD
A[Generate Data (data_as_obj)] --> B[Shuffle Keys and Create Dict]
B --> C[Convert to JSON Object with String Keys (data_as_str)]
C --> D[Set CPU Affinity to cores 0 & 1]
D --> E[Measure Serialized Size (orjson)]
E --> F[Benchmark Loop Over Libraries]
F --> G{Library == "json"?}
G -- Yes --> H[Time json.dumps on str keys and int keys]
G -- No --> I[Time orjson.dumps with OPT_NON_STR_KEYS and OPT_SORT_KEYS]
H & I --> J[Test Correctness (orjson only)]
J --> K[Calculate Per-Iteration Latency]
K --> L[Store Results in Table]
L --> M[Print Tabulated Benchmark Report]
Summary
`pynonstr` is a focused Python benchmark script designed to evaluate JSON serialization speed and correctness, especially handling mixed key types and sorting, comparing the high-performance `orjson` library against the standard Python `json` module. It uses synthetic date-keyed data, CPU affinity settings, and repeated timing measures to provide reliable latency metrics, outputting a formatted comparison table. This script complements a larger benchmarking project aimed at validating and showcasing orjson’s performance advantages in realistic serialization scenarios.