豆豆友情提示:这是一个非官方 GitHub 代理镜像,主要用于网络测试或访问加速。请勿在此进行登录、注册或处理任何敏感信息。进行这些操作请务必访问官方网站 github.com。 Raw 内容也通过此代理提供。
Skip to content

Migrate from Python re to regex module to improve unicode compatibility #19977

@LeonarddeR

Description

@LeonarddeR

Related issues, PRs or discussions

What is the current state of the codebase?

Current Implementation:

  • NVDA speech dictionaries use Python's built-in re module for text pattern matching
  • Located in: nvda/speech/dictionaries/ and related speech processing modules
  • Current usage patterns:
    • Symbol substitution in symbols.dic processing
    • Complex punctuation handling across multiple languages
    • Unicode text normalization and matching

Limitations of current re module:

  • Limited Unicode property support (e.g., \p{L} not available)
  • No named capture groups with backreferences in replacements
  • Inconsistent behavior across Unicode categories
  • No support for recursive patterns
  • Limited lookahead/lookbehind assertions (fixed-width only)
  • Performance issues with complex patterns on large text

Current impact on users:

  • Improper handling of non-Latin scripts (Arabic, Cyrillic, CJK)
  • Limited symbol substitution flexibility
  • Reduced effectiveness of speech dictionary rules

Why are changes required?

  1. Multilingual Support: Enhanced Unicode property support improves handling of diverse languages
  2. Rule Expressiveness: Dictionary maintainers can write more powerful and maintainable rules
  3. Performance: The regex module offers optimizations for complex patterns
  4. Maintenance: Cleaner regex syntax reduces dictionary maintenance burden

What technical changes are required?

  1. Add the regex module as a dependency: uv add regex
  2. Use it as a drop-in replacement: import regex as re. A migration can also be explicit.

Are the proposed technical changes API breaking?

  1. Regex Syntax Changes

    • Some advanced regex features may behave differently
  2. Error Messages

    • Different error messages and exception types in edge cases
    • Could impact error handling in try/except blocks relying on specific messages
  3. Unicode Behavior - re.UNICODE flag behavior differs subtly - Some Unicode edge cases handled differently (usually better)

    • This is likely beneficial for non-English speech output

Are there potential risks or issues with the proposed implementation?

Distribution/Testing Risk (Medium)

  • Issue: Requires testing across:
    • Multiple locales and languages
  • Mitigation:
    • Add multilingual test corpus

Behavior Divergence Risk (Low-Medium)

  • Issue: Subtle differences in Unicode handling could affect output
  • Mitigation:
    • Clear changelog of behavior changes

Metadata

Metadata

Assignees

No one assigned

    Labels

    blocked/needs-product-decisionA product decision needs to be made. Decisions about NVDA UX or supported use-cases.needs-triage

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions