wcwidth.py

Overview

The `wcwidth.py` module provides utility functions to measure the display width of Unicode characters and strings in terminal environments. This is critical for applications that require precise alignment of text output, such as command-line interfaces, terminal user interfaces, and text editors.

The module exposes two main functions:

wcwidth(c: str) -> int: Computes the column width of a single Unicode character.
wcswidth(s: str) -> int: Computes the total column width of a Unicode string.

Widths are returned according to terminal display conventions:

-1 if the character/string contains non-printable/control characters.
0 for zero-width characters (e.g., combining marks).
1 for most printable characters.
2 for East Asian wide/fullwidth characters.

The module uses Unicode properties and caching to efficiently determine these widths.

Detailed Explanation

Function: `wcwidth`

@lru_cache(100)
def wcwidth(c: str) -> int:
    """Determine how many columns are needed to display a character in a terminal.

    Returns -1 if the character is not printable.
    Returns 0, 1 or 2 for other characters.
    """

**Purpose:** Calculates the number of terminal column cells required to display the character `c`.

**Parameters:**

c (str): A single Unicode character (string of length 1).

**Returns:**

int: The display width of c in columns.
- -1 if the character is non-printable/control.
- 0 for zero-width characters (e.g., combining marks).
- 1 for regular width characters.
- 2 for East Asian fullwidth or wide characters.

**Implementation details:**

Uses ord(c) to get the Unicode code point.
Fast path for ASCII printable characters (code points 0x20 to 0x7E) returning width 1.
Special zero-width characters explicitly checked by codepoint ranges.
Uses unicodedata.category to identify control characters (Cc) and combining marks (Me, Mn).
Uses unicodedata.east_asian_width to identify East Asian fullwidth (F) and wide (W) characters which take two columns.
Caches results for performance with functools.lru_cache (cache size 100).

**Usage example:**

print(wcwidth('a'))     # Output: 1
print(wcwidth('あ'))    # Output: 2 (Hiragana character)
print(wcwidth('\u0301')) # Output: 0 (Combining acute accent)
print(wcwidth('\x07'))   # Output: -1 (Bell control character)

Function: `wcswidth`

def wcswidth(s: str) -> int:
    """Determine how many columns are needed to display a string in a terminal.

    Returns -1 if the string contains non-printable characters.
    """

**Purpose:** Computes the total terminal column width required to display the entire string `s`.

**Parameters:**

s (str): A Unicode string.

**Returns:**

int: Total display width in columns.
- Returns -1 if any character in the string is non-printable (width -1).
- Otherwise, sum of widths of all characters.

**Implementation details:**

Normalizes the string using Unicode Normalization Form C (NFC) to ensure combined characters are composed.
Iterates over each character in the normalized string.
Calls wcwidth on each character.
If any character has width -1, returns -1.
Otherwise, accumulates widths and returns the total.

**Usage example:**

print(wcswidth("hello"))          # Output: 5
print(wcswidth("コンニチハ"))       # Output: 10 (Each character width 2)
print(wcswidth("a\u0301"))        # Output: 1 ('a' + combining acute accent)
print(wcswidth("hello\x07world")) # Output: -1 (Bell character inside string)

Important Implementation Details and Algorithms

Caching: The use of @lru_cache(100) on wcwidth significantly improves performance when measuring strings with repeated characters, avoiding redundant Unicode property lookups.
Unicode Property Checks:
The module relies on Python's built-in unicodedata module for:
- Character category (unicodedata.category), e.g., control characters (Cc), combining marks (Me, Mn).
- East Asian Width (unicodedata.east_asian_width), identifying wide/fullwidth characters needing two columns.
Special Zero-Width Characters:
Explicit codepoint checks handle certain zero-width formatting characters, e.g., zero-width space (U+200B) and other formatting codes.
Normalization:
The wcswidth function normalizes strings to NFC form to ensure consistency in display width calculations, especially for composed characters.

Interaction with Other System Components

This module is a utility component typically used by terminal-based applications, such as:
- Text editors that need to align columns or cursor positions.
- Command-line interfaces (CLIs) that display tabular or formatted data.
- Terminal UI frameworks for layout calculation.
It is standalone and does not depend on or modify global state outside the cache.
Can be imported as a helper module wherever precise control over text display width is required.

Mermaid Diagram

Below is a flowchart representing the main functions and their relationships in `wcwidth.py`.

flowchart TD
    A[Start: Input character or string]

    subgraph Single Character Width
        direction TB
        B[wcwidth(c)]
        B --> C{Is c ASCII printable?}
        C -- Yes --> D[Return 1]
        C -- No --> E{Is c zero-width special char?}
        E -- Yes --> F[Return 0]
        E -- No --> G{Is c control character?}
        G -- Yes --> H[Return -1]
        G -- No --> I{Is c combining mark?}
        I -- Yes --> F
        I -- No --> J{Is c East Asian Wide/Fullwidth?}
        J -- Yes --> K[Return 2]
        J -- No --> D
    end

    subgraph String Width Calculation
        direction TB
        L[wcswidth(s)]
        L --> M[Normalize s with NFC]
        M --> N[For each character c in s]
        N --> B
        B --> O{wcwidth(c) >= 0?}
        O -- No --> P[Return -1]
        O -- Yes --> Q[Accumulate total width]
        Q --> R{More characters?}
        R -- Yes --> N
        R -- No --> S[Return total width]
    end

Summary

The `wcwidth.py` module provides efficient, Unicode-aware functions to measure the number of terminal columns required to display characters and strings. By leveraging Unicode properties and caching, it delivers accurate width measurements essential for terminal text layout and alignment tasks. Its simple interface and reliance on Python standard libraries make it easy to integrate into terminal-based applications requiring precise text formatting.