run_mem

Overview

The `run_mem` script is a utility designed to measure the **memory consumption** of repeated JSON deserialization operations using different JSON libraries. It loads a compressed JSON fixture from an `.xz` file, repeatedly deserializes it 100 times, and profiles the Resident Set Size (RSS) memory usage before and after these operations. Additionally, it verifies the **correctness** of the deserialization and serialization cycle to ensure data integrity.

This script supports benchmarking memory usage for two JSON libraries:

Python’s built-in json module
The orjson library (a high-performance Rust-backed JSON library)

By quantifying memory growth after repeated loads, the script helps assess the memory efficiency and potential leaks or overheads associated with each library's deserialization process.

Detailed Explanation

Script Flow

Input arguments:
- filename: Path to a compressed .xz JSON fixture file.
- lib_name: The JSON library to benchmark ("json" or "orjson").
Load fixture data:
- The JSON fixture is decompressed and read entirely into memory using the lzma module.
Select JSON library:
- Dynamically imports dumps and loads functions from the specified library.
Garbage collection:
- Calls gc.collect() to minimize noise from leftover garbage before memory measurement.
Memory measurement:
- Uses psutil.Process() to get the current process memory info (RSS).
- Records RSS memory before starting deserialization.
Repeated deserialization:
- Runs the loads function on the fixture 100 times.
Measure memory again:
- Records RSS memory after deserialization loop.
Calculate difference:
- Computes the delta in memory usage (mem_diff).
Correctness check:
- Validates that deserializing and reserializing the fixture results in equivalent JSON data:
  - Loads the fixture with the standard json library.
  - Loads and dumps it with the tested library.
  - Compares if the reconstructed JSON matches the original.
Output results:
- Prints three comma-separated values:
  - Memory before deserialization (in bytes)
  - Memory difference after deserialization loop (in bytes)
  - Correctness flag (1 if correct, 0 if incorrect)

Code Breakdown

#!/usr/bin/env python3
# SPDX-License-Identifier: (Apache-2.0 OR MIT)

import sys
import lzma
import gc
import psutil

filename = sys.argv[1]

# Load compressed fixture file
with lzma.open(filename, "r") as fileh:
    fixture = fileh.read()  # fixture is bytes

proc = psutil.Process()

lib_name = sys.argv[2]
if lib_name == "json":
    from json import dumps, loads
elif lib_name == "orjson":
    from orjson import dumps, loads
else:
    raise NotImplementedError

gc.collect()

# Measure memory before deserialization
mem_before = proc.memory_info().rss

# Deserialize fixture 100 times
for _ in range(100):
    val = loads(fixture)

# Measure memory after deserialization
mem_after = proc.memory_info().rss

mem_diff = mem_after - mem_before

# Correctness check using standard json for canonicalization
from json import loads as json_loads

correct = 1 if (json_loads(fixture) == json_loads(dumps(loads(fixture)))) else 0

# Print results
print(f"{mem_before},{mem_diff},{correct}")

Parameters

Parameter	Description	Type	Example
`filename`	Path to the `.xz` compressed JSON fixture file	`str`	`data/github.json.xz`
`lib_name`	The JSON library to use for deserialization	`str`	`"json"` or `"orjson"`

Output

Printed to standard output as a single line:

<mem_before>,<mem_diff>,<correct>

mem_before: Resident Set Size (RSS) memory in bytes before deserialization.
mem_diff: Difference in RSS memory after 100 deserializations.
correct: 1 if deserialization and serialization produce equivalent JSON, else 0.

Usage Example

Assuming the script is executable and named `run_mem`:

./run_mem data/github.json.xz json

Output example:

12345678,102400,1

This means:

Memory before deserialization: ~12.3 MB
Memory increased by ~100 KB after 100 deserializations
Correctness check passed

Important Implementation Details

Memory Measurement: Uses psutil.Process().memory_info().rss for Resident Set Size, which reflects the actual physical memory used by the process.
Garbage Collection: Invoking gc.collect() before measurement reduces memory noise from uncollected objects.
Repeated Deserialization: Running 100 iterations amplifies memory changes, making small leaks or usage more detectable.
Dynamic Import: The script dynamically imports dumps and loads based on the library argument, enabling easy comparison between different JSON parsing libraries.
Correctness Verification: Uses the standard json module to canonicalize JSON data before and after round-trip serialization, ensuring output equivalence regardless of library-specific formatting differences.

Interaction with Other Parts of the System

JSON Fixtures: Reads JSON data compressed in .xz format, consistent with other benchmarking utilities in the system that use compressed fixtures for realistic data.
JSON Libraries: Compares the memory behavior of orjson versus Python's built-in json library, consistent with the benchmarking framework's goal of evaluating performance and correctness across JSON implementations.
Benchmarking Framework: Complements timing benchmarks (e.g., benchmark_loads.py) by focusing on memory profiling.
Profiling Tools: Uses psutil for memory profiling, consistent with other scripts that use system-level process introspection.
Correctness Tests: Uses standard JSON loads/dumps as a baseline for correctness, harmonizing validation across benchmarking scripts.

Visual Diagram: Flowchart of Memory Benchmarking Workflow

flowchart TD
    A[Start: Receive <filename>, <lib_name>] --> B[Load compressed JSON fixture (.xz)]
    B --> C[Import dumps, loads from chosen library]
    C --> D[Run garbage collection]
    D --> E[Measure initial RSS memory (mem_before)]
    E --> F[Loop 100 times: Deserialize JSON fixture]
    F --> G[Measure final RSS memory (mem_after)]
    G --> H[Calculate memory difference (mem_diff = mem_after - mem_before)]
    H --> I[Check correctness by round-trip serialization]
    I --> J{Is JSON data equivalent?}
    J -- Yes --> K[Set correct = 1]
    J -- No --> L[Set correct = 0]
    K --> M[Print mem_before, mem_diff, correct]
    L --> M

Summary

The `run_mem` script is a focused benchmarking tool for measuring memory consumption impacts of repeated JSON deserialization using different libraries. It provides actionable insights into memory efficiency and correctness, aiding in comparative evaluation of JSON libraries under realistic, repeated workloads. The script integrates with a larger benchmarking ecosystem that measures speed, memory, and correctness, contributing to a comprehensive performance evaluation framework.

End of Documentation for `run_mem`