scalar.rs

Overview

`scalar.rs` is a Rust source file responsible for converting Rust string slices (`&str`) into Python Unicode objects (`PyObject*`) at the FFI (Foreign Function Interface) boundary. This conversion is essential for integrating Rust string data with the Python runtime in a performant and memory-efficient manner.

The core functionality focuses on determining the internal Unicode kind (ASCII, one-byte, two-byte, or four-byte) of the input string, then invoking the appropriate Python Unicode constructor function from the `pyunicode_new` module. This ensures that the Python Unicode object correctly represents the Rust string's encoding and character width.


Detailed Documentation

Imports

use crate::str::pyunicode_new::{
    pyunicode_ascii, pyunicode_fourbyte, pyunicode_onebyte, pyunicode_twobyte,
};

Function: str_impl_kind_scalar

#[inline(never)]
pub(crate) unsafe fn str_impl_kind_scalar(buf: &str) -> *mut pyo3_ffi::PyObject

Description

Converts a Rust string slice (`&str`) to a Python Unicode object pointer (`*mut PyObject`) using a scalar (single-threaded, non-SIMD) approach. It determines the best Unicode kind representation based on the string’s content.

Parameters

Returns

Safety

Implementation Details

  1. Count Characters:
    Uses bytecount::num_chars to count the number of Unicode scalar values (characters) in the byte slice derived from buf.

  2. ASCII Shortcut:
    If the byte length equals the character count (meaning all characters are single-byte ASCII), it immediately calls pyunicode_ascii.

  3. Non-ASCII Handling:
    If not ASCII, the function:

    • Checks if the string length is greater than zero (via assume! macro).

    • Checks the first byte; if it is greater than 239, it immediately treats the string as containing four-byte characters, returning pyunicode_fourbyte.

    • Otherwise, iterates over all bytes in buf:

      • Sets is_four if any byte is > 239.

      • Sets not_latin if any byte is > 195.

    • Returns the corresponding Unicode object based on these flags:

      • pyunicode_fourbyte if is_four is true.

      • pyunicode_twobyte if not_latin is true.

      • Otherwise, pyunicode_onebyte.

This heuristic optimizes the Unicode kind selection without fully decoding the UTF-8 string, leveraging byte value thresholds to infer character widths.

Usage Example

unsafe {
    let rust_str = "Hello, 世界!";
    let py_unicode_ptr = str_impl_kind_scalar(rust_str);
    // py_unicode_ptr can now be used in Python FFI calls
}

Important Implementation Notes


Interaction with Other Parts of the System


Visual Diagram

flowchart TD
    A[Input Rust &str] --> B[Count UTF-8 chars (bytecount::num_chars)]
    B --> C{Is ASCII?}
    C -- Yes --> D[pyunicode_ascii]
    C -- No --> E{First byte > 239?}
    E -- Yes --> F[pyunicode_fourbyte]
    E -- No --> G[Iterate bytes]
    G --> H{Any byte > 239?}
    H -- Yes --> F[pyunicode_fourbyte]
    H -- No --> I{Any byte > 195?}
    I -- Yes --> J[pyunicode_twobyte]
    I -- No --> K[pyunicode_onebyte]
    D --> L[Return PyObject*]
    F --> L
    J --> L
    K --> L

Summary

`scalar.rs` provides a critical utility function, `str_impl_kind_scalar`, that converts Rust UTF-8 strings into Python Unicode objects by efficiently determining the appropriate internal Unicode representation. It balances performance and correctness using byte-level heuristics and integrates tightly with the Python FFI layer. This file is fundamental in ensuring seamless and optimized string interoperability between Rust and Python in the broader application.