utf8.rs

Overview

The `utf8.rs` file provides utilities for validating and reading UTF-8 encoded data from various Python object types within a Rust extension module context. Its primary functionality is to safely extract a UTF-8 byte slice (`&[u8]`) from Python objects that represent textual or binary data, ensuring the data is valid UTF-8 and suitable for deserialization.

Specifically, it supports input types including:

The file includes architecture-specific optimizations for UTF-8 validation leveraging SIMD instructions on supported CPUs (x86_64 with AVX2, aarch64 with NEON). It ensures that inputs are contiguous and valid UTF-8, returning descriptive errors otherwise.

This file plays a critical role in the deserialization subsystem by providing a fundamental primitive: converting Python objects into validated UTF-8 byte slices for further processing.


Detailed Documentation

Functions


is_valid_utf8(buf: &[u8]) -> bool

**Purpose:** Performs UTF-8 validation on a byte slice, choosing the most efficient method available depending on the target CPU architecture and features.

**Parameters:**

**Returns:**

**Description:**

**Usage Example:**

let buf = b"hello";
assert!(is_valid_utf8(buf));

let invalid_buf = b"\xFF\xFF";
assert!(!is_valid_utf8(invalid_buf));

**Implementation Details:** This function is conditionally compiled with `#[cfg]` attributes for different architectures and CPU features. It leverages the `simdutf8` crate for SIMD-based validation when possible, providing high-performance UTF-8 checks.


read_input_to_buf(ptr: *mut pyo3_ffi::PyObject) -> Result<&'static [u8], DeserializeError<'static>>

**Purpose:** Extracts a UTF-8 validated byte slice from a given Python object pointer. Supports multiple Python types that can contain UTF-8 data.

**Parameters:**

**Returns:**

**Supported Python Types:**

Python Type

Rust Type Constant

Validation Details

`bytes`

`BYTES_TYPE`

Checks UTF-8 validity on bytes content buffer.

`str`

`STR_TYPE`

Converts Python `str` to Rust UTF-8 slice directly.

`memoryview`

`MEMORYVIEW_TYPE`

Checks C-contiguousness, then validates UTF-8.

`bytearray`

`BYTEARRAY_TYPE`

Checks UTF-8 validity on bytearray buffer.

**Error Cases:**

**Usage Example:**

// Assuming `py_obj` is a pointer to some Python object known at runtime
match read_input_to_buf(py_obj) {
    Ok(buf) => {
        // buf is a validated UTF-8 byte slice, safe to use for deserialization
        println!("Input UTF-8 bytes length: {}", buf.len());
    }
    Err(e) => {
        eprintln!("Failed to read input: {:?}", e);
    }
}

**Implementation Details:**


Important Implementation Details


Interactions With Other Parts of the System


Visual Diagram

The following flowchart depicts the main function `read_input_to_buf` and its decision logic based on the input Python object type, including UTF-8 validation steps:

flowchart TD
    A[Start: Python object pointer] --> B{Determine Python Type}
    B -->|bytes| C[Extract bytes buffer]
    B -->|str| D[Extract str buffer]
    B -->|memoryview| E[Check C-contiguity]
    B -->|bytearray| F[Extract bytearray buffer]
    B -->|other| G[Return Error: Unsupported type]

    C --> H{Validate UTF-8}
    D --> I[Get UTF-8 slice from PyStr]
    E --> J{Validate UTF-8}
    F --> K{Validate UTF-8}
    
    H --> L{Valid?}
    J --> M{Valid?}
    K --> N{Valid?}

    L -->|Yes| O[Check empty buffer]
    L -->|No| P[Return Error: Invalid UTF-8]

    M -->|Yes| O
    M -->|No| P

    N -->|Yes| O
    N -->|No| P

    I --> O

    O -->|Empty| Q[Return Error: Empty input]
    O -->|Non-empty| R[Return Ok(&[u8])]

    P --> S[Return Err(DeserializeError)]
    Q --> S

Summary

The `utf8.rs` file implements crucial functionality to safely convert Python objects into validated UTF-8 byte slices using architecture-optimized validation and careful handling of Python C API types. It ensures robustness and performance in the deserialization workflow, serving as a foundational building block in the system's data input pipeline.