pynumpy

Overview

The `pynumpy` script is a performance benchmarking utility focused on serializing large NumPy arrays to JSON formats using different libraries. It creates a large random NumPy array of a specified data type, serializes it using both `orjson` (a fast JSON library with native NumPy support) and Python's built-in `json` module (with a custom encoder for NumPy arrays), then measures and compares:

Serialization latency (time taken)
Memory usage difference during serialization
Serialization correctness by verifying round-trip equality

The script outputs a performance comparison table highlighting latency, memory impact, and relative performance against the `orjson` baseline.

Detailed Explanation

Key Concepts

Data Type (kind) Selection: The script accepts a command-line argument to determine the NumPy data type for the array to be benchmarked (e.g., float16, int32, bool).
Array Generation: Based on the selected data type, a large NumPy array is created with random values.
Serialization Functions:
- orjson_dumps(): Uses orjson with NumPy support to serialize the array.
- json_dumps(): Uses the standard json module with a custom encoder for NumPy arrays.
Benchmarking:
- Measures time for multiple iterations.
- Tracks memory usage during serialization.
- Validates correctness by deserializing and comparing with the original array.

Script Flow

CPU Affinity Set: Limits the script to CPU cores 0 and 1 for consistent benchmarking.
Argument Parsing: Reads the data type from command-line arguments.
Array Initialization: Creates a random NumPy array with the specified type and shape.
Serialization Function Definitions: Defines two serializing functions for orjson and json.
Memory and Latency Measurement: Uses timeit for latency and memory_profiler for memory impact.
Correctness Check: Ensures serialized and deserialized data matches the original.
Results Tabulation: Outputs a GitHub-flavored Markdown table summarizing results.

Functions

`default(__obj)`

Converts NumPy arrays to lists for compatibility with Python’s built-in `json` module serialization.

Parameters:
__obj (any) — Object to serialize.
Returns:
list if __obj is a NumPy array.
Raises:
TypeError if the object is not a NumPy array.
Usage:
Used as the default parameter in json.dumps to handle NumPy arrays.

json.dumps(array, default=default)

`orjson_dumps()`

Serializes the global `array` using `orjson` with the option to directly handle NumPy arrays.

Parameters: None.
Returns:
Serialized bytes.
Usage:

data_bytes = orjson_dumps()

`json_dumps()`

Serializes the global `array` using Python’s built-in `json.dumps` with a custom encoder for NumPy arrays.

Parameters: None.
Returns:
Serialized bytes (UTF-8 encoded).
Usage:

data_bytes = json_dumps()

`per_iter_latency(val)`

Calculates average latency per iteration in milliseconds.

Parameters:
val (float or None) — Total time for all iterations in seconds.
Returns:
Average time per iteration in milliseconds or None if input is None.
Example:

avg_latency_ms = per_iter_latency(0.5)  # For 10 iterations, returns 50 ms

`test_correctness(func)`

Validates serialization correctness by serializing and then deserializing the global `array`, comparing the result with the original array.

Parameters:
func (callable) — Serialization function returning serialized bytes.
Returns:
True if deserialized data matches original array; False otherwise.
Usage:

is_correct = test_correctness(orjson_dumps)

Important Implementation Details

CPU Affinity: The script restricts execution to CPUs 0 and 1 using os.sched_setaffinity to reduce variability in timing due to CPU scheduling.
Memory Profiling: Uses memory_profiler.memory_usage() to track peak memory usage during serialization calls.
Timing: Uses timeit.timeit() to measure total serialization time over multiple iterations (default 10).
Output Formatting: Uses tabulate library to produce a neat Markdown table summarizing results.
Comparison Metric: Latency of each library is compared against orjson as a baseline (ratio shown).

Interaction with Other System Components

Dependencies:
- numpy: For array creation and manipulation.
- orjson: High-performance JSON serialization, especially optimized for NumPy arrays.
- json: Python standard library for JSON serialization.
- psutil: Process and system utilities, used here for memory info.
- memory_profiler: For measuring memory usage during function execution.
- tabulate: For formatting output tables.
Input: Command-line argument specifying the data type to benchmark.
Output: Console output with serialization size and performance table.
No external file or database interactions are made; this is a standalone benchmarking utility.

Usage Example

python pynumpy float32

This command benchmarks serialization of a large float32 NumPy array.

Expected output snippet:

xx.xMiB float32 output (orjson)
orjson...
json...
| Library | Latency (ms) | RSS diff (MiB) | vs. orjson |
|---------|--------------|----------------|------------|
| orjson  | 100          | 10             | 1.0        |
| json    | 500          | 20             | 5.0        |

Mermaid Class Diagram

Since this file is a utility script with key functions and no classes, the diagram below is a **flowchart** illustrating the main functions and their relationships within the benchmarking workflow.

flowchart TD
    A[Start: Parse Command-Line Arg] --> B[Set CPU Affinity]
    B --> C[Generate NumPy Array]
    C --> D[Define Serialization Functions]
    D --> E[Measure orjson_dumps()]
    D --> F[Measure json_dumps()]
    E --> G[Test Correctness (orjson)]
    F --> H[Test Correctness (json)]
    G --> I[Measure Memory Usage]
    H --> I
    I --> J[Calculate Latency & Memory Differences]
    J --> K[Build Result Table]
    K --> L[Print Table & Exit]

Summary

The `pynumpy` script is a specialized benchmarking tool designed for comparing JSON serialization performance of large NumPy arrays between `orjson` and Python's `json` module. It provides detailed latency and memory usage data, including correctness verification, helping developers choose optimal serialization strategies for numerical data in JSON format.