pynumpy
Overview
The `pynumpy` script is a performance benchmarking utility focused on serializing large NumPy arrays to JSON formats using different libraries. It creates a large random NumPy array of a specified data type, serializes it using both `orjson` (a fast JSON library with native NumPy support) and Python's built-in `json` module (with a custom encoder for NumPy arrays), then measures and compares:
Serialization latency (time taken)
Memory usage difference during serialization
Serialization correctness by verifying round-trip equality
The script outputs a performance comparison table highlighting latency, memory impact, and relative performance against the `orjson` baseline.
Detailed Explanation
Key Concepts
Data Type (
kind) Selection: The script accepts a command-line argument to determine the NumPy data type for the array to be benchmarked (e.g.,float16,int32,bool).Array Generation: Based on the selected data type, a large NumPy array is created with random values.
Serialization Functions:
orjson_dumps(): Usesorjsonwith NumPy support to serialize the array.json_dumps(): Uses the standardjsonmodule with a custom encoder for NumPy arrays.
Benchmarking:
Measures time for multiple iterations.
Tracks memory usage during serialization.
Validates correctness by deserializing and comparing with the original array.
Script Flow
CPU Affinity Set: Limits the script to CPU cores 0 and 1 for consistent benchmarking.
Argument Parsing: Reads the data type from command-line arguments.
Array Initialization: Creates a random NumPy array with the specified type and shape.
Serialization Function Definitions: Defines two serializing functions for
orjsonandjson.Memory and Latency Measurement: Uses
timeitfor latency andmemory_profilerfor memory impact.Correctness Check: Ensures serialized and deserialized data matches the original.
Results Tabulation: Outputs a GitHub-flavored Markdown table summarizing results.
Functions
default(__obj)
Converts NumPy arrays to lists for compatibility with Python’s built-in `json` module serialization.
Parameters:
__obj(any) — Object to serialize.Returns:
list if__objis a NumPy array.Raises:
TypeError if the object is not a NumPy array.Usage:
Used as thedefaultparameter injson.dumpsto handle NumPy arrays.
json.dumps(array, default=default)
orjson_dumps()
Serializes the global `array` using `orjson` with the option to directly handle NumPy arrays.
Parameters: None.
Returns:
Serialized bytes.Usage:
data_bytes = orjson_dumps()
json_dumps()
Serializes the global `array` using Python’s built-in `json.dumps` with a custom encoder for NumPy arrays.
Parameters: None.
Returns:
Serialized bytes (UTF-8 encoded).Usage:
data_bytes = json_dumps()
per_iter_latency(val)
Calculates average latency per iteration in milliseconds.
Parameters:
val(float or None) — Total time for all iterations in seconds.Returns:
Average time per iteration in milliseconds orNoneif input isNone.Example:
avg_latency_ms = per_iter_latency(0.5) # For 10 iterations, returns 50 ms
test_correctness(func)
Validates serialization correctness by serializing and then deserializing the global `array`, comparing the result with the original array.
Parameters:
func(callable) — Serialization function returning serialized bytes.Returns:
Trueif deserialized data matches original array;Falseotherwise.Usage:
is_correct = test_correctness(orjson_dumps)
Important Implementation Details
CPU Affinity: The script restricts execution to CPUs 0 and 1 using
os.sched_setaffinityto reduce variability in timing due to CPU scheduling.Memory Profiling: Uses
memory_profiler.memory_usage()to track peak memory usage during serialization calls.Timing: Uses
timeit.timeit()to measure total serialization time over multiple iterations (default 10).Output Formatting: Uses
tabulatelibrary to produce a neat Markdown table summarizing results.Comparison Metric: Latency of each library is compared against
orjsonas a baseline (ratio shown).
Interaction with Other System Components
Dependencies:
numpy: For array creation and manipulation.orjson: High-performance JSON serialization, especially optimized for NumPy arrays.json: Python standard library for JSON serialization.psutil: Process and system utilities, used here for memory info.memory_profiler: For measuring memory usage during function execution.tabulate: For formatting output tables.
Input: Command-line argument specifying the data type to benchmark.
Output: Console output with serialization size and performance table.
No external file or database interactions are made; this is a standalone benchmarking utility.
Usage Example
python pynumpy float32
This command benchmarks serialization of a large float32 NumPy array.
Expected output snippet:
xx.xMiB float32 output (orjson)
orjson...
json...
| Library | Latency (ms) | RSS diff (MiB) | vs. orjson |
|---------|--------------|----------------|------------|
| orjson | 100 | 10 | 1.0 |
| json | 500 | 20 | 5.0 |
Mermaid Class Diagram
Since this file is a utility script with key functions and no classes, the diagram below is a **flowchart** illustrating the main functions and their relationships within the benchmarking workflow.
flowchart TD
A[Start: Parse Command-Line Arg] --> B[Set CPU Affinity]
B --> C[Generate NumPy Array]
C --> D[Define Serialization Functions]
D --> E[Measure orjson_dumps()]
D --> F[Measure json_dumps()]
E --> G[Test Correctness (orjson)]
F --> H[Test Correctness (json)]
G --> I[Measure Memory Usage]
H --> I
I --> J[Calculate Latency & Memory Differences]
J --> K[Build Result Table]
K --> L[Print Table & Exit]
Summary
The `pynumpy` script is a specialized benchmarking tool designed for comparing JSON serialization performance of large NumPy arrays between `orjson` and Python's `json` module. It provides detailed latency and memory usage data, including correctness verification, helping developers choose optimal serialization strategies for numerical data in JSON format.