utf8.rs
Overview
The `utf8.rs` file provides utilities for validating and reading UTF-8 encoded data from various Python object types within a Rust extension module context. Its primary functionality is to safely extract a UTF-8 byte slice (`&[u8]`) from Python objects that represent textual or binary data, ensuring the data is valid UTF-8 and suitable for deserialization.
Specifically, it supports input types including:
bytesstrbytearraymemoryview
The file includes architecture-specific optimizations for UTF-8 validation leveraging SIMD instructions on supported CPUs (x86_64 with AVX2, aarch64 with NEON). It ensures that inputs are contiguous and valid UTF-8, returning descriptive errors otherwise.
This file plays a critical role in the deserialization subsystem by providing a fundamental primitive: converting Python objects into validated UTF-8 byte slices for further processing.
Detailed Documentation
Functions
is_valid_utf8(buf: &[u8]) -> bool
**Purpose:** Performs UTF-8 validation on a byte slice, choosing the most efficient method available depending on the target CPU architecture and features.
**Parameters:**
buf: &[u8]— a byte slice to validate.
**Returns:**
**Description:**
On
x86_64without AVX2 support, it conditionally utilizes AVX2 SIMD validation if detected at runtime, otherwise falls back toencoding_rs.On
x86_64with AVX2, usessimdutf8crate's AVX2 accelerated UTF-8 validation.On
aarch64, uses NEON SIMD acceleration viasimdutf8.On other architectures, uses the standard library's
std::str::from_utf8.
**Usage Example:**
let buf = b"hello";
assert!(is_valid_utf8(buf));
let invalid_buf = b"\xFF\xFF";
assert!(!is_valid_utf8(invalid_buf));
**Implementation Details:** This function is conditionally compiled with `#[cfg]` attributes for different architectures and CPU features. It leverages the `simdutf8` crate for SIMD-based validation when possible, providing high-performance UTF-8 checks.
read_input_to_buf(ptr: *mut pyo3_ffi::PyObject) -> Result<&'static [u8], DeserializeError<'static>>
**Purpose:** Extracts a UTF-8 validated byte slice from a given Python object pointer. Supports multiple Python types that can contain UTF-8 data.
**Parameters:**
ptr: *mut pyo3_ffi::PyObject— raw pointer to a Python object.
**Returns:**
Ok(&'static [u8])— a validated UTF-8 byte slice referencing the Python object's internal buffer.Err(DeserializeError)— if the input is invalid, not UTF-8, or unsupported.
**Supported Python Types:**
Python Type | Rust Type Constant | Validation Details |
|---|---|---|
`bytes` | `BYTES_TYPE` | Checks UTF-8 validity on bytes content buffer. |
`str` | `STR_TYPE` | Converts Python `str` to Rust UTF-8 slice directly. |
`memoryview` | `MEMORYVIEW_TYPE` | Checks C-contiguousness, then validates UTF-8. |
`bytearray` | `BYTEARRAY_TYPE` | Checks UTF-8 validity on bytearray buffer. |
**Error Cases:**
Input is not one of the supported types.
UTF-8 validation fails.
Memoryview buffer is not C-contiguous.
Input is empty (zero-length).
**Usage Example:**
// Assuming `py_obj` is a pointer to some Python object known at runtime
match read_input_to_buf(py_obj) {
Ok(buf) => {
// buf is a validated UTF-8 byte slice, safe to use for deserialization
println!("Input UTF-8 bytes length: {}", buf.len());
}
Err(e) => {
eprintln!("Failed to read input: {:?}", e);
}
}
**Implementation Details:**
Uses unsafe Rust code to cast raw Python buffers to byte slices.
Uses custom macros/functions like
is_type!,ob_type!, andffi!to interact with Python C-API types and functions.Returns descriptive
DeserializeErrorwith borrowed static error messages on failure.Checks for empty input explicitly to avoid processing empty documents.
Important Implementation Details
Architecture-Specific UTF-8 Validation:
The file uses conditional compilation and runtime CPU feature detection to choose between SIMD accelerated UTF-8 validation and fallback methods. This design optimizes performance on modern CPUs while maintaining compatibility.Unsafe Buffer Access:
Access to Python internal buffers is done via raw pointers and unsafe code blocks, necessary for performance and direct memory access. Proper validation and error handling guard against invalid inputs.Error Handling:
The function returns well-definedDeserializeErrorvariants with clear messages, facilitating debugging and robust deserialization logic upstream.Type Checking:
The file uses internal macros to check Python type pointers against known type constants (BYTES_TYPE,STR_TYPE, etc.) to determine the correct extraction method.
Interactions With Other Parts of the System
Deserialization Module:
This file supports the deserialization pipeline by providing the initial step of reading UTF-8 input from Python objects. The returned byte slices are consumed by parsers or other deserialization components.Python FFI Layer (
pyo3_ffi):
Interacts heavily with Python C API via FFI, using pointers and Python internal structures likePyBytes,PyStr,PyMemoryView, andPyByteArray.Utility Modules:
Uses utility functions likeisize_to_usizefor safe size conversions and constants such asINVALID_STRfor error messages.String Handling (
PyStr):
Converts Pythonstrobjects to Rust UTF-8 slices via thePyStrabstraction.
Visual Diagram
The following flowchart depicts the main function `read_input_to_buf` and its decision logic based on the input Python object type, including UTF-8 validation steps:
flowchart TD
A[Start: Python object pointer] --> B{Determine Python Type}
B -->|bytes| C[Extract bytes buffer]
B -->|str| D[Extract str buffer]
B -->|memoryview| E[Check C-contiguity]
B -->|bytearray| F[Extract bytearray buffer]
B -->|other| G[Return Error: Unsupported type]
C --> H{Validate UTF-8}
D --> I[Get UTF-8 slice from PyStr]
E --> J{Validate UTF-8}
F --> K{Validate UTF-8}
H --> L{Valid?}
J --> M{Valid?}
K --> N{Valid?}
L -->|Yes| O[Check empty buffer]
L -->|No| P[Return Error: Invalid UTF-8]
M -->|Yes| O
M -->|No| P
N -->|Yes| O
N -->|No| P
I --> O
O -->|Empty| Q[Return Error: Empty input]
O -->|Non-empty| R[Return Ok(&[u8])]
P --> S[Return Err(DeserializeError)]
Q --> S
Summary
The `utf8.rs` file implements crucial functionality to safely convert Python objects into validated UTF-8 byte slices using architecture-optimized validation and careful handling of Python C API types. It ensures robustness and performance in the deserialization workflow, serving as a foundational building block in the system's data input pipeline.