utf8.rs

Overview

The `utf8.rs` file provides utilities for validating and reading UTF-8 encoded data from various Python object types within a Rust extension module context. Its primary functionality is to safely extract a UTF-8 byte slice (`&[u8]`) from Python objects that represent textual or binary data, ensuring the data is valid UTF-8 and suitable for deserialization.

Specifically, it supports input types including:

bytes
str
bytearray
memoryview

The file includes architecture-specific optimizations for UTF-8 validation leveraging SIMD instructions on supported CPUs (x86_64 with AVX2, aarch64 with NEON). It ensures that inputs are contiguous and valid UTF-8, returning descriptive errors otherwise.

This file plays a critical role in the deserialization subsystem by providing a fundamental primitive: converting Python objects into validated UTF-8 byte slices for further processing.

Detailed Documentation

Functions

`is_valid_utf8(buf: &[u8]) -> bool`

**Purpose:** Performs UTF-8 validation on a byte slice, choosing the most efficient method available depending on the target CPU architecture and features.

**Parameters:**

buf: &[u8] — a byte slice to validate.

**Returns:**

bool — true if the buffer is valid UTF-8, otherwise false.

**Description:**

On x86_64 without AVX2 support, it conditionally utilizes AVX2 SIMD validation if detected at runtime, otherwise falls back to encoding_rs.
On x86_64 with AVX2, uses simdutf8 crate's AVX2 accelerated UTF-8 validation.
On aarch64, uses NEON SIMD acceleration via simdutf8.
On other architectures, uses the standard library's std::str::from_utf8.

**Usage Example:**

let buf = b"hello";
assert!(is_valid_utf8(buf));

let invalid_buf = b"\xFF\xFF";
assert!(!is_valid_utf8(invalid_buf));

**Implementation Details:** This function is conditionally compiled with `#[cfg]` attributes for different architectures and CPU features. It leverages the `simdutf8` crate for SIMD-based validation when possible, providing high-performance UTF-8 checks.

`read_input_to_buf(ptr: *mut pyo3_ffi::PyObject) -> Result<&'static [u8], DeserializeError<'static>>`

**Purpose:** Extracts a UTF-8 validated byte slice from a given Python object pointer. Supports multiple Python types that can contain UTF-8 data.

**Parameters:**

ptr: *mut pyo3_ffi::PyObject — raw pointer to a Python object.

**Returns:**

Ok(&'static [u8]) — a validated UTF-8 byte slice referencing the Python object's internal buffer.
Err(DeserializeError) — if the input is invalid, not UTF-8, or unsupported.

**Supported Python Types:**

Python Type	Rust Type Constant	Validation Details
`bytes`	`BYTES_TYPE`	Checks UTF-8 validity on bytes content buffer.
`str`	`STR_TYPE`	Converts Python `str` to Rust UTF-8 slice directly.
`memoryview`	`MEMORYVIEW_TYPE`	Checks C-contiguousness, then validates UTF-8.
`bytearray`	`BYTEARRAY_TYPE`	Checks UTF-8 validity on bytearray buffer.

**Error Cases:**

Input is not one of the supported types.
UTF-8 validation fails.
Memoryview buffer is not C-contiguous.
Input is empty (zero-length).

**Usage Example:**

// Assuming `py_obj` is a pointer to some Python object known at runtime
match read_input_to_buf(py_obj) {
    Ok(buf) => {
        // buf is a validated UTF-8 byte slice, safe to use for deserialization
        println!("Input UTF-8 bytes length: {}", buf.len());
    }
    Err(e) => {
        eprintln!("Failed to read input: {:?}", e);
    }
}

**Implementation Details:**

Uses unsafe Rust code to cast raw Python buffers to byte slices.
Uses custom macros/functions like is_type!, ob_type!, and ffi! to interact with Python C-API types and functions.
Returns descriptive DeserializeError with borrowed static error messages on failure.
Checks for empty input explicitly to avoid processing empty documents.

Important Implementation Details

Architecture-Specific UTF-8 Validation:
The file uses conditional compilation and runtime CPU feature detection to choose between SIMD accelerated UTF-8 validation and fallback methods. This design optimizes performance on modern CPUs while maintaining compatibility.
Unsafe Buffer Access:
Access to Python internal buffers is done via raw pointers and unsafe code blocks, necessary for performance and direct memory access. Proper validation and error handling guard against invalid inputs.
Error Handling:
The function returns well-defined DeserializeError variants with clear messages, facilitating debugging and robust deserialization logic upstream.
Type Checking:
The file uses internal macros to check Python type pointers against known type constants (BYTES_TYPE, STR_TYPE, etc.) to determine the correct extraction method.

Interactions With Other Parts of the System

Deserialization Module:
This file supports the deserialization pipeline by providing the initial step of reading UTF-8 input from Python objects. The returned byte slices are consumed by parsers or other deserialization components.
Python FFI Layer (pyo3_ffi):
Interacts heavily with Python C API via FFI, using pointers and Python internal structures like PyBytes, PyStr, PyMemoryView, and PyByteArray.
Utility Modules:
Uses utility functions like isize_to_usize for safe size conversions and constants such as INVALID_STR for error messages.
String Handling (PyStr):
Converts Python str objects to Rust UTF-8 slices via the PyStr abstraction.

Visual Diagram

The following flowchart depicts the main function `read_input_to_buf` and its decision logic based on the input Python object type, including UTF-8 validation steps:

flowchart TD
    A[Start: Python object pointer] --> B{Determine Python Type}
    B -->|bytes| C[Extract bytes buffer]
    B -->|str| D[Extract str buffer]
    B -->|memoryview| E[Check C-contiguity]
    B -->|bytearray| F[Extract bytearray buffer]
    B -->|other| G[Return Error: Unsupported type]

    C --> H{Validate UTF-8}
    D --> I[Get UTF-8 slice from PyStr]
    E --> J{Validate UTF-8}
    F --> K{Validate UTF-8}
    
    H --> L{Valid?}
    J --> M{Valid?}
    K --> N{Valid?}

    L -->|Yes| O[Check empty buffer]
    L -->|No| P[Return Error: Invalid UTF-8]

    M -->|Yes| O
    M -->|No| P

    N -->|Yes| O
    N -->|No| P

    I --> O

    O -->|Empty| Q[Return Error: Empty input]
    O -->|Non-empty| R[Return Ok(&[u8])]

    P --> S[Return Err(DeserializeError)]
    Q --> S

Summary

The `utf8.rs` file implements crucial functionality to safely convert Python objects into validated UTF-8 byte slices using architecture-optimized validation and careful handling of Python C API types. It ensures robustness and performance in the deserialization workflow, serving as a foundational building block in the system's data input pipeline.