scalar.rs
Overview
`scalar.rs` is a Rust source file responsible for converting Rust string slices (`&str`) into Python Unicode objects (`PyObject*`) at the FFI (Foreign Function Interface) boundary. This conversion is essential for integrating Rust string data with the Python runtime in a performant and memory-efficient manner.
The core functionality focuses on determining the internal Unicode kind (ASCII, one-byte, two-byte, or four-byte) of the input string, then invoking the appropriate Python Unicode constructor function from the `pyunicode_new` module. This ensures that the Python Unicode object correctly represents the Rust string's encoding and character width.
Detailed Documentation
Imports
use crate::str::pyunicode_new::{
pyunicode_ascii, pyunicode_fourbyte, pyunicode_onebyte, pyunicode_twobyte,
};
Imports specific Unicode constructors that create Python Unicode objects of different internal representations.
Function: str_impl_kind_scalar
#[inline(never)]
pub(crate) unsafe fn str_impl_kind_scalar(buf: &str) -> *mut pyo3_ffi::PyObject
Description
Converts a Rust string slice (`&str`) to a Python Unicode object pointer (`*mut PyObject`) using a scalar (single-threaded, non-SIMD) approach. It determines the best Unicode kind representation based on the string’s content.
Parameters
buf: &str
The Rust string slice to convert.
Returns
*mut pyo3_ffi::PyObject
A raw pointer to a Python Unicode object representing the input string.
Safety
The function is marked
unsafebecause it returns a raw Python object pointer and uses unsafe operations like pointer dereferencing.The caller must ensure the GIL (Global Interpreter Lock) is held and manage the lifetime of the returned Python object correctly.
Implementation Details
Count Characters:
Usesbytecount::num_charsto count the number of Unicode scalar values (characters) in the byte slice derived frombuf.ASCII Shortcut:
If the byte length equals the character count (meaning all characters are single-byte ASCII), it immediately callspyunicode_ascii.Non-ASCII Handling:
If not ASCII, the function:Checks if the string length is greater than zero (via
assume!macro).Checks the first byte; if it is greater than 239, it immediately treats the string as containing four-byte characters, returning
pyunicode_fourbyte.Otherwise, iterates over all bytes in
buf:Sets
is_fourif any byte is > 239.Sets
not_latinif any byte is > 195.
Returns the corresponding Unicode object based on these flags:
pyunicode_fourbyteifis_fouris true.pyunicode_twobyteifnot_latinis true.Otherwise,
pyunicode_onebyte.
This heuristic optimizes the Unicode kind selection without fully decoding the UTF-8 string, leveraging byte value thresholds to infer character widths.
Usage Example
unsafe {
let rust_str = "Hello, 世界!";
let py_unicode_ptr = str_impl_kind_scalar(rust_str);
// py_unicode_ptr can now be used in Python FFI calls
}
Important Implementation Notes
The function uses low-level pointer arithmetic and byte inspection for performance reasons.
The
unlikely!macro hints to the compiler that certain branches are unlikely to be taken, optimizing branch prediction.The function is marked
#[inline(never)]to prevent inlining, possibly to reduce code bloat or to isolate unsafe code.The function assumes that the input string is valid UTF-8 (guaranteed by Rust
&strtype).The reliance on byte comparisons (
> 239,> 195) corresponds to UTF-8 leading byte ranges for different Unicode planes:Bytes > 239 indicate 4-byte UTF-8 sequences (characters outside the Basic Multilingual Plane).
Bytes > 195 indicate 2+ byte sequences.
Interaction with Other Parts of the System
pyunicode_newModule:
This function calls into thepyunicode_newmodule’s functions to create Python Unicode objects. These functions wrap the Python C API calls that allocate and initialize Unicode objects with different internal memory layouts.Python FFI Layer:
The returned*mut pyo3_ffi::PyObjectis a raw pointer to a Python object, used by the PyO3 library or other parts of the Rust-Python bridge to pass strings into Python runtime.String Handling Pipeline:
This function is a lower-level utility that may be invoked by higher-level string conversion functions when bridging Rust to Python, particularly when no SIMD optimizations are applicable (hence "scalar").Byte Counting Dependency:
The function relies on thebytecountcrate’snum_charsmethod to quickly count UTF-8 characters without full decoding.
Visual Diagram
flowchart TD
A[Input Rust &str] --> B[Count UTF-8 chars (bytecount::num_chars)]
B --> C{Is ASCII?}
C -- Yes --> D[pyunicode_ascii]
C -- No --> E{First byte > 239?}
E -- Yes --> F[pyunicode_fourbyte]
E -- No --> G[Iterate bytes]
G --> H{Any byte > 239?}
H -- Yes --> F[pyunicode_fourbyte]
H -- No --> I{Any byte > 195?}
I -- Yes --> J[pyunicode_twobyte]
I -- No --> K[pyunicode_onebyte]
D --> L[Return PyObject*]
F --> L
J --> L
K --> L
Summary
`scalar.rs` provides a critical utility function, `str_impl_kind_scalar`, that converts Rust UTF-8 strings into Python Unicode objects by efficiently determining the appropriate internal Unicode representation. It balances performance and correctness using byte-level heuristics and integrates tightly with the Python FFI layer. This file is fundamental in ensuring seamless and optimized string interoperability between Rust and Python in the broader application.