run_mem
Overview
The `run_mem` script is a utility designed to measure the **memory consumption** of repeated JSON deserialization operations using different JSON libraries. It loads a compressed JSON fixture from an `.xz` file, repeatedly deserializes it 100 times, and profiles the Resident Set Size (RSS) memory usage before and after these operations. Additionally, it verifies the **correctness** of the deserialization and serialization cycle to ensure data integrity.
This script supports benchmarking memory usage for two JSON libraries:
Python’s built-in
jsonmoduleThe
orjsonlibrary (a high-performance Rust-backed JSON library)
By quantifying memory growth after repeated loads, the script helps assess the memory efficiency and potential leaks or overheads associated with each library's deserialization process.
Detailed Explanation
Script Flow
Input arguments:
filename: Path to a compressed.xzJSON fixture file.lib_name: The JSON library to benchmark ("json"or"orjson").
Load fixture data:
The JSON fixture is decompressed and read entirely into memory using the
lzmamodule.
Select JSON library:
Dynamically imports
dumpsandloadsfunctions from the specified library.
Garbage collection:
Calls
gc.collect()to minimize noise from leftover garbage before memory measurement.
Memory measurement:
Uses
psutil.Process()to get the current process memory info (RSS).Records RSS memory before starting deserialization.
Repeated deserialization:
Runs the
loadsfunction on the fixture 100 times.
Measure memory again:
Records RSS memory after deserialization loop.
Calculate difference:
Computes the delta in memory usage (
mem_diff).
Correctness check:
Validates that deserializing and reserializing the fixture results in equivalent JSON data:
Loads the fixture with the standard
jsonlibrary.Loads and dumps it with the tested library.
Compares if the reconstructed JSON matches the original.
Output results:
Prints three comma-separated values:
Memory before deserialization (in bytes)
Memory difference after deserialization loop (in bytes)
Correctness flag (
1if correct,0if incorrect)
Code Breakdown
#!/usr/bin/env python3
# SPDX-License-Identifier: (Apache-2.0 OR MIT)
import sys
import lzma
import gc
import psutil
filename = sys.argv[1]
# Load compressed fixture file
with lzma.open(filename, "r") as fileh:
fixture = fileh.read() # fixture is bytes
proc = psutil.Process()
lib_name = sys.argv[2]
if lib_name == "json":
from json import dumps, loads
elif lib_name == "orjson":
from orjson import dumps, loads
else:
raise NotImplementedError
gc.collect()
# Measure memory before deserialization
mem_before = proc.memory_info().rss
# Deserialize fixture 100 times
for _ in range(100):
val = loads(fixture)
# Measure memory after deserialization
mem_after = proc.memory_info().rss
mem_diff = mem_after - mem_before
# Correctness check using standard json for canonicalization
from json import loads as json_loads
correct = 1 if (json_loads(fixture) == json_loads(dumps(loads(fixture)))) else 0
# Print results
print(f"{mem_before},{mem_diff},{correct}")
Parameters
Parameter | Description | Type | Example |
|---|---|---|---|
`filename` | Path to the `.xz` compressed JSON fixture file | `str` | `data/github.json.xz` |
`lib_name` | The JSON library to use for deserialization | `str` | `"json"` or `"orjson"` |
Output
Printed to standard output as a single line:
<mem_before>,<mem_diff>,<correct>
mem_before: Resident Set Size (RSS) memory in bytes before deserialization.mem_diff: Difference in RSS memory after 100 deserializations.correct:1if deserialization and serialization produce equivalent JSON, else0.
Usage Example
Assuming the script is executable and named `run_mem`:
./run_mem data/github.json.xz json
Output example:
12345678,102400,1
This means:
Memory before deserialization: ~12.3 MB
Memory increased by ~100 KB after 100 deserializations
Correctness check passed
Important Implementation Details
Memory Measurement: Uses
psutil.Process().memory_info().rssfor Resident Set Size, which reflects the actual physical memory used by the process.Garbage Collection: Invoking
gc.collect()before measurement reduces memory noise from uncollected objects.Repeated Deserialization: Running 100 iterations amplifies memory changes, making small leaks or usage more detectable.
Dynamic Import: The script dynamically imports
dumpsandloadsbased on the library argument, enabling easy comparison between different JSON parsing libraries.Correctness Verification: Uses the standard
jsonmodule to canonicalize JSON data before and after round-trip serialization, ensuring output equivalence regardless of library-specific formatting differences.
Interaction with Other Parts of the System
JSON Fixtures: Reads JSON data compressed in
.xzformat, consistent with other benchmarking utilities in the system that use compressed fixtures for realistic data.JSON Libraries: Compares the memory behavior of
orjsonversus Python's built-injsonlibrary, consistent with the benchmarking framework's goal of evaluating performance and correctness across JSON implementations.Benchmarking Framework: Complements timing benchmarks (e.g.,
benchmark_loads.py) by focusing on memory profiling.Profiling Tools: Uses
psutilfor memory profiling, consistent with other scripts that use system-level process introspection.Correctness Tests: Uses standard JSON loads/dumps as a baseline for correctness, harmonizing validation across benchmarking scripts.
Visual Diagram: Flowchart of Memory Benchmarking Workflow
flowchart TD
A[Start: Receive <filename>, <lib_name>] --> B[Load compressed JSON fixture (.xz)]
B --> C[Import dumps, loads from chosen library]
C --> D[Run garbage collection]
D --> E[Measure initial RSS memory (mem_before)]
E --> F[Loop 100 times: Deserialize JSON fixture]
F --> G[Measure final RSS memory (mem_after)]
G --> H[Calculate memory difference (mem_diff = mem_after - mem_before)]
H --> I[Check correctness by round-trip serialization]
I --> J{Is JSON data equivalent?}
J -- Yes --> K[Set correct = 1]
J -- No --> L[Set correct = 0]
K --> M[Print mem_before, mem_diff, correct]
L --> M
Summary
The `run_mem` script is a focused benchmarking tool for measuring memory consumption impacts of repeated JSON deserialization using different libraries. It provides actionable insights into memory efficiency and correctness, aiding in comparative evaluation of JSON libraries under realistic, repeated workloads. The script integrates with a larger benchmarking ecosystem that measures speed, memory, and correctness, contributing to a comprehensive performance evaluation framework.