pynumpy


Overview

The `pynumpy` script is a performance benchmarking utility focused on serializing large NumPy arrays to JSON formats using different libraries. It creates a large random NumPy array of a specified data type, serializes it using both `orjson` (a fast JSON library with native NumPy support) and Python's built-in `json` module (with a custom encoder for NumPy arrays), then measures and compares:

The script outputs a performance comparison table highlighting latency, memory impact, and relative performance against the `orjson` baseline.


Detailed Explanation

Key Concepts


Script Flow

  1. CPU Affinity Set: Limits the script to CPU cores 0 and 1 for consistent benchmarking.

  2. Argument Parsing: Reads the data type from command-line arguments.

  3. Array Initialization: Creates a random NumPy array with the specified type and shape.

  4. Serialization Function Definitions: Defines two serializing functions for orjson and json.

  5. Memory and Latency Measurement: Uses timeit for latency and memory_profiler for memory impact.

  6. Correctness Check: Ensures serialized and deserialized data matches the original.

  7. Results Tabulation: Outputs a GitHub-flavored Markdown table summarizing results.


Functions

default(__obj)

Converts NumPy arrays to lists for compatibility with Python’s built-in `json` module serialization.

json.dumps(array, default=default)

orjson_dumps()

Serializes the global `array` using `orjson` with the option to directly handle NumPy arrays.

data_bytes = orjson_dumps()

json_dumps()

Serializes the global `array` using Python’s built-in `json.dumps` with a custom encoder for NumPy arrays.

data_bytes = json_dumps()

per_iter_latency(val)

Calculates average latency per iteration in milliseconds.

avg_latency_ms = per_iter_latency(0.5)  # For 10 iterations, returns 50 ms

test_correctness(func)

Validates serialization correctness by serializing and then deserializing the global `array`, comparing the result with the original array.

is_correct = test_correctness(orjson_dumps)

Important Implementation Details


Interaction with Other System Components


Usage Example

python pynumpy float32

This command benchmarks serialization of a large float32 NumPy array.

Expected output snippet:

xx.xMiB float32 output (orjson)
orjson...
json...
| Library | Latency (ms) | RSS diff (MiB) | vs. orjson |
|---------|--------------|----------------|------------|
| orjson  | 100          | 10             | 1.0        |
| json    | 500          | 20             | 5.0        |

Mermaid Class Diagram

Since this file is a utility script with key functions and no classes, the diagram below is a **flowchart** illustrating the main functions and their relationships within the benchmarking workflow.

flowchart TD
    A[Start: Parse Command-Line Arg] --> B[Set CPU Affinity]
    B --> C[Generate NumPy Array]
    C --> D[Define Serialization Functions]
    D --> E[Measure orjson_dumps()]
    D --> F[Measure json_dumps()]
    E --> G[Test Correctness (orjson)]
    F --> H[Test Correctness (json)]
    G --> I[Measure Memory Usage]
    H --> I
    I --> J[Calculate Latency & Memory Differences]
    J --> K[Build Result Table]
    K --> L[Print Table & Exit]

Summary

The `pynumpy` script is a specialized benchmarking tool designed for comparing JSON serialization performance of large NumPy arrays between `orjson` and Python's `json` module. It provides detailed latency and memory usage data, including correctness verification, helping developers choose optimal serialization strategies for numerical data in JSON format.