pydataclass

Overview

The `pydataclass` script is a Python benchmarking utility designed to measure and compare the serialization performance of Python dataclasses versus dictionaries using two JSON libraries: the standard `json` module and the high-performance Rust-backed `orjson` library. It benchmarks the time taken to serialize a large collection of complex nested objects, represented both as Python dataclasses and as dictionaries, repeatedly over multiple iterations.

The key goals of this script are:

To demonstrate the serialization overhead differences between dataclass instances and equivalent dictionary representations.
To compare the speed of serialization across two distinct JSON libraries.
To highlight the efficiency of orjson's optimized serialization of dataclasses with the OPT_SERIALIZE_DATACLASS option.

The script outputs the serialized data size in KiB and presents a table summarizing the serialization latency (in milliseconds) for each library and data format, along with relative performance comparisons against `orjson`.

Detailed Explanation

Imports and Setup

dataclasses: Provides the @dataclass decorator and related utilities for defining lightweight data containers.
io: Used to create an in-memory string buffer for tabulated output.
json: Python’s built-in JSON serialization/deserialization module.
os: Used here to set CPU affinity to cores {0, 1} for consistent benchmarking.
timeit: Used to measure execution time of serialization functions.
tabulate: Formats output results into a readable table.
orjson: A fast JSON library implemented in Rust, exposing optimized serialization APIs.

os.sched_setaffinity(os.getpid(), {0, 1})

*This restricts the process to CPU cores 0 and 1 to reduce variability in timing results.*

Classes

`Member`

@dataclasses.dataclass
class Member:
    id: int
    active: bool

Purpose: Represents a member entity with an integer ID and an active status flag.
Fields:
- id (int): Unique identifier for the member.
- active (bool): Status indicating if the member is active.

`Object`

@dataclasses.dataclass
class Object:
    id: int
    name: str
    members: list[Member]

Purpose: Represents a complex object containing an ID, a name string, and a list of Member instances.
Fields:
- id (int): Unique identifier for the object.
- name (str): A string name (constructed as repeated digits).
- members (list of Member): List of associated members.

Data Preparation

objects_as_dataclass: A list of 2,000 Object instances, each containing 10 Member instances with active=True.

objects_as_dataclass = [
    Object(i, str(i) * 3, [Member(j, True) for j in range(10)])
    for i in range(100000, 102000)
]

objects_as_dict: The equivalent list of dictionaries obtained by converting each dataclass instance to a dictionary using dataclasses.asdict().

objects_as_dict = [dataclasses.asdict(each) for each in objects_as_dataclass]

This setup creates two structurally equivalent datasets encoded differently: one as dataclasses and one as dictionaries.

Serialization Size Output

The script first outputs the size of the serialized data (in KiB) when serialized using `orjson` on the dictionary representation:

output_in_kib = len(orjson.dumps(objects_as_dict)) / 1024
print(f"{output_in_kib:,.0f}KiB output (orjson)")

This gives a rough sense of the serialized JSON payload size for the benchmarked data.

Custom Default Function for `json.dumps`

def default(__obj):
    if dataclasses.is_dataclass(__obj):
        return dataclasses.asdict(__obj)

Used as a fallback serializer for json.dumps to convert dataclass instances to dictionaries, since the standard json module cannot serialize dataclasses directly.
Returns the dictionary representation if the object is a dataclass.

Benchmark Setup

Headers for output table:

headers = ("Library", "dict (ms)", "dataclass (ms)", "vs. orjson")

Libraries to benchmark:

LIBRARIES = ("orjson", "json")

Number of iterations per benchmark:

ITERATIONS = 100

Utility Function: `per_iter_latency`

def per_iter_latency(val):
    if val is None:
        return None
    return (val * 1000) / ITERATIONS

Converts total timing results (seconds) from timeit into average latency per iteration in milliseconds.

Benchmarking Loop

For each library in `LIBRARIES`, the script measures serialization latency for:

Dictionary data — serializing objects_as_dict.
Dataclass data — serializing objects_as_dataclass.

For `json`:

Uses json.dumps with fallback default for dataclasses.
Encodes output to UTF-8 bytes for consistent comparison.

For `orjson`:

Uses orjson.dumps directly for dictionaries.
Uses orjson.dumps with OPT_SERIALIZE_DATACLASS option for dataclasses to enable native dataclass serialization.

Benchmark times are measured with `timeit.timeit` over `ITERATIONS`.

Relative Performance Calculation

compared_to_orjson is set to 1 for orjson (baseline).
For json, it calculates the integer ratio of json dataclass serialization time over orjson dataclass serialization time.
This indicates how many times slower json is compared to orjson.

Results Table Formatting and Output

Collected metrics are formatted and appended to a table.
The table is printed using tabulate in GitHub-flavored Markdown format.

Example output table:

Library	dict (ms)	dataclass (ms)	vs. orjson
orjson	12.34	15.67	1
json	234.56	345.67	22

Usage Example

Run the script directly:

./pydataclass

Expected output (example):

134KiB output (orjson)
| Library | dict (ms) | dataclass (ms) | vs. orjson |
|---------|-----------|----------------|------------|
| orjson  | 12.34     | 15.67          | 1          |
| json    | 234.56    | 345.67         | 22         |

Important Implementation Details

Setting CPU affinity ensures consistent benchmarking by limiting execution to cores 0 and 1.
The dataclass serialization leverages orjson's OPT_SERIALIZE_DATACLASS flag for zero-copy, efficient serialization.
The fallback default function for json.dumps is necessary because json does not natively support dataclasses.
The script benchmarks serialization only, not deserialization.
The use of timeit.timeit ensures accurate timing over multiple iterations.
The data volume (2,000 objects, each with 10 members) is large enough to produce measurable serialization latencies.
Output size measurement using orjson confirms the serialized payload size for reference.

Interaction with Other System Components

This script is part of the benchmarking and performance testing suite that compares JSON serialization libraries and approaches.
It complements other benchmark scripts that measure deserialization or different serialization options.
It uses the tabulate library to present results clearly but does not depend on the benchmark framework (pytest-benchmark) used in other parts.
The script’s results can inform optimization decisions for JSON serialization in the larger project, especially regarding dataclass usage.
It indirectly validates the efficiency of the Rust-backed orjson library as part of the JSON serialization ecosystem.

Mermaid Diagram: Class Structure and Serialization Flow

classDiagram
    class Member {
        +int id
        +bool active
    }
    class Object {
        +int id
        +str name
        +list~Member~ members
    }
    Member <|-- Object : contains
    class pydataclass_script {
        +objects_as_dataclass: list~Object~
        +objects_as_dict: list~dict~
        +default(obj): dict
        +per_iter_latency(val): float
        +main()
    }

Summary

The `pydataclass` script benchmarks JSON serialization performance between dataclasses and dictionaries across two popular JSON libraries (`orjson` and `json`). It highlights the performance benefits of using `orjson`'s native dataclass serialization support and provides a clear, tabulated comparison of serialization latencies. The script is a practical tool to understand serialization overheads when using Python dataclasses and to demonstrate the efficiency of Rust-backed JSON libraries in real-world, nested data scenarios.