scalar.rs

Overview

`scalar.rs` is a Rust source file responsible for converting Rust string slices (`&str`) into Python Unicode objects (`PyObject*`) at the FFI (Foreign Function Interface) boundary. This conversion is essential for integrating Rust string data with the Python runtime in a performant and memory-efficient manner.

The core functionality focuses on determining the internal Unicode kind (ASCII, one-byte, two-byte, or four-byte) of the input string, then invoking the appropriate Python Unicode constructor function from the `pyunicode_new` module. This ensures that the Python Unicode object correctly represents the Rust string's encoding and character width.

Detailed Documentation

Imports

use crate::str::pyunicode_new::{
    pyunicode_ascii, pyunicode_fourbyte, pyunicode_onebyte, pyunicode_twobyte,
};

Imports specific Unicode constructors that create Python Unicode objects of different internal representations.

Function: `str_impl_kind_scalar`

#[inline(never)]
pub(crate) unsafe fn str_impl_kind_scalar(buf: &str) -> *mut pyo3_ffi::PyObject

Description

Converts a Rust string slice (`&str`) to a Python Unicode object pointer (`*mut PyObject`) using a scalar (single-threaded, non-SIMD) approach. It determines the best Unicode kind representation based on the string’s content.

Parameters

buf: &str
The Rust string slice to convert.

Returns

*mut pyo3_ffi::PyObject
A raw pointer to a Python Unicode object representing the input string.

Safety

The function is marked unsafe because it returns a raw Python object pointer and uses unsafe operations like pointer dereferencing.
The caller must ensure the GIL (Global Interpreter Lock) is held and manage the lifetime of the returned Python object correctly.

Implementation Details

Count Characters:
Uses bytecount::num_chars to count the number of Unicode scalar values (characters) in the byte slice derived from buf.
ASCII Shortcut:
If the byte length equals the character count (meaning all characters are single-byte ASCII), it immediately calls pyunicode_ascii.
Non-ASCII Handling:
If not ASCII, the function:
- Checks if the string length is greater than zero (via assume! macro).
- Checks the first byte; if it is greater than 239, it immediately treats the string as containing four-byte characters, returning pyunicode_fourbyte.
- Otherwise, iterates over all bytes in buf:
  - Sets is_four if any byte is > 239.
  - Sets not_latin if any byte is > 195.
- Returns the corresponding Unicode object based on these flags:
  - pyunicode_fourbyte if is_four is true.
  - pyunicode_twobyte if not_latin is true.
  - Otherwise, pyunicode_onebyte.

This heuristic optimizes the Unicode kind selection without fully decoding the UTF-8 string, leveraging byte value thresholds to infer character widths.

Usage Example

unsafe {
    let rust_str = "Hello, 世界!";
    let py_unicode_ptr = str_impl_kind_scalar(rust_str);
    // py_unicode_ptr can now be used in Python FFI calls
}

Important Implementation Notes

The function uses low-level pointer arithmetic and byte inspection for performance reasons.
The unlikely! macro hints to the compiler that certain branches are unlikely to be taken, optimizing branch prediction.
The function is marked #[inline(never)] to prevent inlining, possibly to reduce code bloat or to isolate unsafe code.
The function assumes that the input string is valid UTF-8 (guaranteed by Rust &str type).
The reliance on byte comparisons (> 239, > 195) corresponds to UTF-8 leading byte ranges for different Unicode planes:
- Bytes > 239 indicate 4-byte UTF-8 sequences (characters outside the Basic Multilingual Plane).
- Bytes > 195 indicate 2+ byte sequences.

Interaction with Other Parts of the System

pyunicode_new Module:
This function calls into the pyunicode_new module’s functions to create Python Unicode objects. These functions wrap the Python C API calls that allocate and initialize Unicode objects with different internal memory layouts.
Python FFI Layer:
The returned *mut pyo3_ffi::PyObject is a raw pointer to a Python object, used by the PyO3 library or other parts of the Rust-Python bridge to pass strings into Python runtime.
String Handling Pipeline:
This function is a lower-level utility that may be invoked by higher-level string conversion functions when bridging Rust to Python, particularly when no SIMD optimizations are applicable (hence "scalar").
Byte Counting Dependency:
The function relies on the bytecount crate’s num_chars method to quickly count UTF-8 characters without full decoding.

Visual Diagram

flowchart TD
    A[Input Rust &str] --> B[Count UTF-8 chars (bytecount::num_chars)]
    B --> C{Is ASCII?}
    C -- Yes --> D[pyunicode_ascii]
    C -- No --> E{First byte > 239?}
    E -- Yes --> F[pyunicode_fourbyte]
    E -- No --> G[Iterate bytes]
    G --> H{Any byte > 239?}
    H -- Yes --> F[pyunicode_fourbyte]
    H -- No --> I{Any byte > 195?}
    I -- Yes --> J[pyunicode_twobyte]
    I -- No --> K[pyunicode_onebyte]
    D --> L[Return PyObject*]
    F --> L
    J --> L
    K --> L

Summary

`scalar.rs` provides a critical utility function, `str_impl_kind_scalar`, that converts Rust UTF-8 strings into Python Unicode objects by efficiently determining the appropriate internal Unicode representation. It balances performance and correctness using byte-level heuristics and integrates tightly with the Python FFI layer. This file is fundamental in ensuring seamless and optimized string interoperability between Rust and Python in the broader application.