pyunicode_new.rs

Overview

The [pyunicode_new.rs](/projects/287/67765) file provides low-level Rust functions to create new Python Unicode objects with different character widths using the Python C API (via FFI). It is part of a Rust binding or extension module interfacing directly with Python's internal Unicode representation structures. The primary purpose of this file is to allocate and initialize Python Unicode objects efficiently, depending on the kind of characters contained (ASCII, one-byte, two-byte, or four-byte Unicode).

The functions in this file build Python Unicode objects by calling Python's internal API (`PyUnicode_New`) and then manually copying the character data into the appropriate memory layout for Python's Unicode string representation. This is essential for building Python strings from Rust with precise control over encoding and memory layout, which is critical in performance-sensitive or low-level Python interop scenarios.


Detailed Explanation of Functions

Macro: validate_str!

macro_rules! validate_str {
    ($ptr:expr) => {
        #[cfg(CPython)]
        debug_assert!(ffi!(_PyUnicode_CheckConsistency($ptr.cast::<PyObject>(), 1)) == 1);
    };
}

Function: pyunicode_ascii

pub(crate) fn pyunicode_ascii(buf: *const u8, num_chars: usize) -> *mut pyo3_ffi::PyObject;
let ascii_bytes = b"hello";
let py_unicode_ptr = pyunicode_ascii(ascii_bytes.as_ptr(), ascii_bytes.len());
// py_unicode_ptr can now be used as a Python Unicode object in FFI calls

Function: pyunicode_onebyte

pub(crate) fn pyunicode_onebyte(buf: &str, num_chars: usize) -> *mut pyo3_ffi::PyObject;
let s = "héllo"; // 'é' fits in one byte extended ASCII
let py_obj = pyunicode_onebyte(s, s.chars().count());

Function: pyunicode_twobyte

pub(crate) fn pyunicode_twobyte(buf: &str, num_chars: usize) -> *mut pyo3_ffi::PyObject;
let s = "こんにちは";  // Japanese characters within u16 range
let py_obj = pyunicode_twobyte(s, s.chars().count());

Function: pyunicode_fourbyte

pub(crate) fn pyunicode_fourbyte(buf: &str, num_chars: usize) -> *mut pyo3_ffi::PyObject;
let s = "𐍈"; // A rare Gothic letter outside BMP
let py_obj = pyunicode_fourbyte(s, s.chars().count());

Important Implementation Details


Interaction with Other Parts of the System


Visual Diagram

flowchart TD
    A[Input: Raw Buffer (ASCII / &str)] --> B[pyunicode_ascii / pyunicode_onebyte / pyunicode_twobyte / pyunicode_fourbyte]
    B --> C[Call PyUnicode_New(maxchar)]
    C --> D[Get pointer to Unicode buffer after header]
    D --> E[Copy characters into buffer (u8, u16, or u32)]
    E --> F[Write null terminator]
    F --> G[validate_str! to check consistency]
    G --> H[Return pointer to PyObject]

Summary

[pyunicode_new.rs](/projects/287/67765) implements specialized constructors for Python Unicode objects at the FFI layer in Rust, supporting different Unicode character widths. It leverages Python’s internal Unicode APIs and memory layouts to efficiently create new Python strings from Rust data buffers, essential for low-level Python interop and extension modules. The file balances unsafe pointer manipulation with debug-time validations to maintain internal consistency. It acts as a foundational utility in the larger system that bridges Rust and Python Unicode string handling.