pyunicode_new.rs

Overview

The [pyunicode_new.rs](/projects/287/67765) file provides low-level Rust functions to create new Python Unicode objects with different character widths using the Python C API (via FFI). It is part of a Rust binding or extension module interfacing directly with Python's internal Unicode representation structures. The primary purpose of this file is to allocate and initialize Python Unicode objects efficiently, depending on the kind of characters contained (ASCII, one-byte, two-byte, or four-byte Unicode).

The functions in this file build Python Unicode objects by calling Python's internal API (`PyUnicode_New`) and then manually copying the character data into the appropriate memory layout for Python's Unicode string representation. This is essential for building Python strings from Rust with precise control over encoding and memory layout, which is critical in performance-sensitive or low-level Python interop scenarios.

Detailed Explanation of Functions

Macro: `validate_str!`

macro_rules! validate_str {
    ($ptr:expr) => {
        #[cfg(CPython)]
        debug_assert!(ffi!(_PyUnicode_CheckConsistency($ptr.cast::<PyObject>(), 1)) == 1);
    };
}

Purpose: A debugging macro to validate the internal consistency of the newly created Unicode objects.
Parameters: $ptr - pointer to the Unicode object.
Behavior: When compiled with CPython, it calls the internal _PyUnicode_CheckConsistency function to ensure the Unicode data structure is internally consistent.
Return: None (debug assertion).
Usage: Called after the Unicode object is created and initialized to catch issues early during development.

Function: `pyunicode_ascii`

pub(crate) fn pyunicode_ascii(buf: *const u8, num_chars: usize) -> *mut pyo3_ffi::PyObject;

Purpose: Creates a new Python Unicode object containing ASCII characters.
Parameters:
- buf: Pointer to a buffer of ASCII bytes (u8).
- num_chars: Number of ASCII characters in the buffer.
Returns: Pointer to a new Python Unicode object (*mut PyObject).
Details:
- Calls PyUnicode_New with max character ordinal 127 (ASCII range).
- Copies the raw bytes directly into the Unicode object's internal ASCII data buffer.
- Null-terminates the string.
- Validates the resulting object with validate_str!.
Usage Example:

let ascii_bytes = b"hello";
let py_unicode_ptr = pyunicode_ascii(ascii_bytes.as_ptr(), ascii_bytes.len());
// py_unicode_ptr can now be used as a Python Unicode object in FFI calls

Function: `pyunicode_onebyte`

pub(crate) fn pyunicode_onebyte(buf: &str, num_chars: usize) -> *mut pyo3_ffi::PyObject;

Purpose: Creates a Python Unicode object using a one-byte encoding (0–255).
Parameters:
- buf: Rust string slice containing characters representable in one byte.
- num_chars: Number of characters in the string.
Returns: Pointer to a new Python Unicode object (*mut PyObject).
Details:
- Allocates a new Unicode object with max ordinal 255.
- Iterates over each character, casting it to u8, and writes it sequentially into the Unicode buffer.
- Null-terminates the string.
- Validates the object.
Usage Example:

let s = "héllo"; // 'é' fits in one byte extended ASCII
let py_obj = pyunicode_onebyte(s, s.chars().count());

Function: `pyunicode_twobyte`

pub(crate) fn pyunicode_twobyte(buf: &str, num_chars: usize) -> *mut pyo3_ffi::PyObject;

Purpose: Creates a Python Unicode object using a two-byte encoding (0–65535).
Parameters:
- buf: Rust string slice with characters representable in 16 bits.
- num_chars: Number of characters.
Returns: Pointer to a new Python Unicode object.
Details:
- Calls PyUnicode_New with max ordinal 65535.
- Writes each character as a u16 into the Unicode buffer.
- Null-terminated.
- Validated for consistency.
Usage:

let s = "こんにちは";  // Japanese characters within u16 range
let py_obj = pyunicode_twobyte(s, s.chars().count());

Function: `pyunicode_fourbyte`

pub(crate) fn pyunicode_fourbyte(buf: &str, num_chars: usize) -> *mut pyo3_ffi::PyObject;

Purpose: Creates a Python Unicode object using a four-byte encoding (up to Unicode max code point 1,114,111).
Parameters:
- buf: Rust string slice with characters possibly needing 32-bit representation.
- num_chars: Number of characters.
Returns: Pointer to new Python Unicode object.
Details:
- Calls PyUnicode_New with max ordinal 1,114,111 (max Unicode scalar value).
- Writes each character as a u32.
- Null-terminated.
- Validated for internal consistency.
Usage:

let s = "𐍈"; // A rare Gothic letter outside BMP
let py_obj = pyunicode_fourbyte(s, s.chars().count());

Important Implementation Details

Unsafe Code: All functions use unsafe Rust code to interface directly with Python's C API and to manipulate raw pointers for performance and control.
Memory Layout: Offset by 1 to skip the header (PyASCIIObject or PyCompactUnicodeObject) and write raw character data directly.
Null Termination: Explicitly writes a null terminator after the last character, aligning with Python's internal string representation.
Character Width Handling: Python internally optimizes Unicode storage based on the maximum ordinal in the string (ASCII, Latin-1, UCS-2, UCS-4). These functions correspond to these internal representations.
Validation: Uses Python's internal consistency check in debug builds to ensure correctness.

Interaction with Other Parts of the System

FFI Layer: Relies on pyo3_ffi and its types (PyObject, PyASCIIObject, etc.) and the ffi! macro to call Python C API functions.
Utility Functions: Uses usize_to_isize from a local util crate/module to convert sizes safely for Python API calls.
Python Unicode Internals: Directly manipulates Python internal Unicode structures, so tightly coupled with CPython's Unicode implementation.
Higher-level Python Bindings: These low-level functions likely serve as building blocks for higher-level Rust functions or PyO3 bindings that expose string creation to safe Rust or Python code.

Visual Diagram

flowchart TD
    A[Input: Raw Buffer (ASCII / &str)] --> B[pyunicode_ascii / pyunicode_onebyte / pyunicode_twobyte / pyunicode_fourbyte]
    B --> C[Call PyUnicode_New(maxchar)]
    C --> D[Get pointer to Unicode buffer after header]
    D --> E[Copy characters into buffer (u8, u16, or u32)]
    E --> F[Write null terminator]
    F --> G[validate_str! to check consistency]
    G --> H[Return pointer to PyObject]

Summary

[pyunicode_new.rs](/projects/287/67765) implements specialized constructors for Python Unicode objects at the FFI layer in Rust, supporting different Unicode character widths. It leverages Python’s internal Unicode APIs and memory layouts to efficiently create new Python strings from Rust data buffers, essential for low-level Python interop and extension modules. The file balances unsafe pointer manipulation with debug-time validations to maintain internal consistency. It acts as a foundational utility in the larger system that bridges Rust and Python Unicode string handling.