pydataclass
Overview
The `pydataclass` script is a Python benchmarking utility designed to measure and compare the serialization performance of Python dataclasses versus dictionaries using two JSON libraries: the standard `json` module and the high-performance Rust-backed `orjson` library. It benchmarks the time taken to serialize a large collection of complex nested objects, represented both as Python dataclasses and as dictionaries, repeatedly over multiple iterations.
The key goals of this script are:
To demonstrate the serialization overhead differences between dataclass instances and equivalent dictionary representations.
To compare the speed of serialization across two distinct JSON libraries.
To highlight the efficiency of
orjson's optimized serialization of dataclasses with theOPT_SERIALIZE_DATACLASSoption.
The script outputs the serialized data size in KiB and presents a table summarizing the serialization latency (in milliseconds) for each library and data format, along with relative performance comparisons against `orjson`.
Detailed Explanation
Imports and Setup
dataclasses: Provides the@dataclassdecorator and related utilities for defining lightweight data containers.io: Used to create an in-memory string buffer for tabulated output.json: Python’s built-in JSON serialization/deserialization module.os: Used here to set CPU affinity to cores{0, 1}for consistent benchmarking.timeit: Used to measure execution time of serialization functions.tabulate: Formats output results into a readable table.orjson: A fast JSON library implemented in Rust, exposing optimized serialization APIs.
os.sched_setaffinity(os.getpid(), {0, 1})
*This restricts the process to CPU cores 0 and 1 to reduce variability in timing results.*
Classes
Member
@dataclasses.dataclass
class Member:
id: int
active: bool
Purpose: Represents a member entity with an integer ID and an active status flag.
Fields:
id(int): Unique identifier for the member.active(bool): Status indicating if the member is active.
Object
@dataclasses.dataclass
class Object:
id: int
name: str
members: list[Member]
Purpose: Represents a complex object containing an ID, a name string, and a list of
Memberinstances.Fields:
id(int): Unique identifier for the object.name(str): A string name (constructed as repeated digits).members(list ofMember): List of associated members.
Data Preparation
objects_as_dataclass: A list of 2,000Objectinstances, each containing 10Memberinstances withactive=True.
objects_as_dataclass = [
Object(i, str(i) * 3, [Member(j, True) for j in range(10)])
for i in range(100000, 102000)
]
objects_as_dict: The equivalent list of dictionaries obtained by converting each dataclass instance to a dictionary usingdataclasses.asdict().
objects_as_dict = [dataclasses.asdict(each) for each in objects_as_dataclass]
This setup creates two structurally equivalent datasets encoded differently: one as dataclasses and one as dictionaries.
Serialization Size Output
The script first outputs the size of the serialized data (in KiB) when serialized using `orjson` on the dictionary representation:
output_in_kib = len(orjson.dumps(objects_as_dict)) / 1024
print(f"{output_in_kib:,.0f}KiB output (orjson)")
This gives a rough sense of the serialized JSON payload size for the benchmarked data.
Custom Default Function for json.dumps
def default(__obj):
if dataclasses.is_dataclass(__obj):
return dataclasses.asdict(__obj)
Used as a fallback serializer for
json.dumpsto convert dataclass instances to dictionaries, since the standardjsonmodule cannot serialize dataclasses directly.Returns the dictionary representation if the object is a dataclass.
Benchmark Setup
Headers for output table:
headers = ("Library", "dict (ms)", "dataclass (ms)", "vs. orjson")
Libraries to benchmark:
LIBRARIES = ("orjson", "json")
Number of iterations per benchmark:
ITERATIONS = 100
Utility Function: per_iter_latency
def per_iter_latency(val):
if val is None:
return None
return (val * 1000) / ITERATIONS
Converts total timing results (seconds) from
timeitinto average latency per iteration in milliseconds.
Benchmarking Loop
For each library in `LIBRARIES`, the script measures serialization latency for:
Dictionary data — serializing
objects_as_dict.Dataclass data — serializing
objects_as_dataclass.
For json:
Uses
json.dumpswith fallbackdefaultfor dataclasses.Encodes output to UTF-8 bytes for consistent comparison.
For orjson:
Uses
orjson.dumpsdirectly for dictionaries.Uses
orjson.dumpswithOPT_SERIALIZE_DATACLASSoption for dataclasses to enable native dataclass serialization.
Benchmark times are measured with `timeit.timeit` over `ITERATIONS`.
Relative Performance Calculation
compared_to_orjsonis set to1fororjson(baseline).For
json, it calculates the integer ratio ofjsondataclass serialization time overorjsondataclass serialization time.This indicates how many times slower
jsonis compared toorjson.
Results Table Formatting and Output
Collected metrics are formatted and appended to a table.
The table is printed using
tabulatein GitHub-flavored Markdown format.
Example output table:
Library | dict (ms) | dataclass (ms) | vs. orjson |
|---|---|---|---|
orjson | 12.34 | 15.67 | 1 |
json | 234.56 | 345.67 | 22 |
Usage Example
Run the script directly:
./pydataclass
Expected output (example):
134KiB output (orjson)
| Library | dict (ms) | dataclass (ms) | vs. orjson |
|---------|-----------|----------------|------------|
| orjson | 12.34 | 15.67 | 1 |
| json | 234.56 | 345.67 | 22 |
Important Implementation Details
Setting CPU affinity ensures consistent benchmarking by limiting execution to cores 0 and 1.
The dataclass serialization leverages
orjson'sOPT_SERIALIZE_DATACLASSflag for zero-copy, efficient serialization.The fallback
defaultfunction forjson.dumpsis necessary becausejsondoes not natively support dataclasses.The script benchmarks serialization only, not deserialization.
The use of
timeit.timeitensures accurate timing over multiple iterations.The data volume (2,000 objects, each with 10 members) is large enough to produce measurable serialization latencies.
Output size measurement using
orjsonconfirms the serialized payload size for reference.
Interaction with Other System Components
This script is part of the benchmarking and performance testing suite that compares JSON serialization libraries and approaches.
It complements other benchmark scripts that measure deserialization or different serialization options.
It uses the
tabulatelibrary to present results clearly but does not depend on the benchmark framework (pytest-benchmark) used in other parts.The script’s results can inform optimization decisions for JSON serialization in the larger project, especially regarding dataclass usage.
It indirectly validates the efficiency of the Rust-backed
orjsonlibrary as part of the JSON serialization ecosystem.
Mermaid Diagram: Class Structure and Serialization Flow
classDiagram
class Member {
+int id
+bool active
}
class Object {
+int id
+str name
+list~Member~ members
}
Member <|-- Object : contains
class pydataclass_script {
+objects_as_dataclass: list~Object~
+objects_as_dict: list~dict~
+default(obj): dict
+per_iter_latency(val): float
+main()
}
Summary
The `pydataclass` script benchmarks JSON serialization performance between dataclasses and dictionaries across two popular JSON libraries (`orjson` and `json`). It highlights the performance benefits of using `orjson`'s native dataclass serialization support and provides a clear, tabulated comparison of serialization latencies. The script is a practical tool to understand serialization overheads when using Python dataclasses and to demonstrate the efficiency of Rust-backed JSON libraries in real-world, nested data scenarios.