Related issues, PRs or discussions
What is the current state of the codebase?
Current Implementation:
- NVDA speech dictionaries use Python's built-in
re module for text pattern matching
- Located in:
nvda/speech/dictionaries/ and related speech processing modules
- Current usage patterns:
- Symbol substitution in
symbols.dic processing
- Complex punctuation handling across multiple languages
- Unicode text normalization and matching
Limitations of current re module:
- Limited Unicode property support (e.g.,
\p{L} not available)
- No named capture groups with backreferences in replacements
- Inconsistent behavior across Unicode categories
- No support for recursive patterns
- Limited lookahead/lookbehind assertions (fixed-width only)
- Performance issues with complex patterns on large text
Current impact on users:
- Improper handling of non-Latin scripts (Arabic, Cyrillic, CJK)
- Limited symbol substitution flexibility
- Reduced effectiveness of speech dictionary rules
Why are changes required?
- Multilingual Support: Enhanced Unicode property support improves handling of diverse languages
- Rule Expressiveness: Dictionary maintainers can write more powerful and maintainable rules
- Performance: The
regex module offers optimizations for complex patterns
- Maintenance: Cleaner regex syntax reduces dictionary maintenance burden
What technical changes are required?
- Add the
regex module as a dependency: uv add regex
- Use it as a drop-in replacement:
import regex as re. A migration can also be explicit.
Are the proposed technical changes API breaking?
-
Regex Syntax Changes
- Some advanced regex features may behave differently
-
Error Messages
- Different error messages and exception types in edge cases
- Could impact error handling in
try/except blocks relying on specific messages
-
Unicode Behavior - re.UNICODE flag behavior differs subtly - Some Unicode edge cases handled differently (usually better)
- This is likely beneficial for non-English speech output
Are there potential risks or issues with the proposed implementation?
Distribution/Testing Risk (Medium)
- Issue: Requires testing across:
- Multiple locales and languages
- Mitigation:
- Add multilingual test corpus
Behavior Divergence Risk (Low-Medium)
- Issue: Subtle differences in Unicode handling could affect output
- Mitigation:
- Clear changelog of behavior changes
Related issues, PRs or discussions
remodule has limited Unicode support and lack of advanced regex featuresWhat is the current state of the codebase?
Current Implementation:
remodule for text pattern matchingnvda/speech/dictionaries/and related speech processing modulessymbols.dicprocessingLimitations of current
remodule:\p{L}not available)Current impact on users:
Why are changes required?
regexmodule offers optimizations for complex patternsWhat technical changes are required?
regexmodule as a dependency:uv add regeximport regex as re. A migration can also be explicit.Are the proposed technical changes API breaking?
Regex Syntax Changes
Error Messages
try/exceptblocks relying on specific messagesUnicode Behavior -
re.UNICODEflag behavior differs subtly - Some Unicode edge cases handled differently (usually better)Are there potential risks or issues with the proposed implementation?
Distribution/Testing Risk (Medium)
Behavior Divergence Risk (Low-Medium)