Migrate from Python `re` to `regex` module to improve unicode compatibility

### Related issues, PRs or discussions
- #19517 
- Python `re` module has limited Unicode support and lack of advanced regex features
- Speech dictionaries require robust pattern matching for multilingual text processing
- Enhancement request: improved regex capabilities for speech symbol processing (#13174 )

### What is the current state of the codebase?

**Current Implementation:**
- NVDA speech dictionaries use Python's built-in `re` module for text pattern matching
- Located in: `nvda/speech/dictionaries/` and related speech processing modules
- Current usage patterns:
  - Symbol substitution in `symbols.dic` processing
  - Complex punctuation handling across multiple languages
  - Unicode text normalization and matching

**Limitations of current `re` module:**
- Limited Unicode property support (e.g., `\p{L}` not available)
- No named capture groups with backreferences in replacements
- Inconsistent behavior across Unicode categories
- No support for recursive patterns
- Limited lookahead/lookbehind assertions (fixed-width only)
- Performance issues with complex patterns on large text

**Current impact on users:**
- Improper handling of non-Latin scripts (Arabic, Cyrillic, CJK)
- Limited symbol substitution flexibility
- Reduced effectiveness of speech dictionary rules

### Why are changes required?

1. **Multilingual Support**: Enhanced Unicode property support improves handling of diverse languages
2. **Rule Expressiveness**: Dictionary maintainers can write more powerful and maintainable rules
3. **Performance**: The `regex` module offers optimizations for complex patterns
4. **Maintenance**: Cleaner regex syntax reduces dictionary maintenance burden

### What technical changes are required?

1. Add the `regex` module as a dependency: `uv add regex`
2. Use it as a drop-in replacement: `import regex as re`. A migration can also be explicit.

### Are the proposed technical changes API breaking?

1. **Regex Syntax Changes**
   - Some advanced regex features may behave differently

2. **Error Messages**
   - Different error messages and exception types in edge cases
   - Could impact error handling in `try/except` blocks relying on specific messages
   
3. **Unicode Behavior**   - `re.UNICODE` flag behavior differs subtly   - Some Unicode edge cases handled differently (usually better)
   - This is likely **beneficial** for non-English speech output

### Are there potential risks or issues with the proposed implementation?

#### **Distribution/Testing Risk** (Medium)
- **Issue**: Requires testing across:
  - Multiple locales and languages
- **Mitigation**:
  - Add multilingual test corpus

#### **Behavior Divergence Risk** (Low-Medium)
- **Issue**: Subtle differences in Unicode handling could affect output
- **Mitigation**:
  - Clear changelog of behavior changes


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate from Python `re` to `regex` module to improve unicode compatibility #19977

Related issues, PRs or discussions

What is the current state of the codebase?

Why are changes required?

What technical changes are required?

Are the proposed technical changes API breaking?

Are there potential risks or issues with the proposed implementation?

Distribution/Testing Risk (Medium)

Behavior Divergence Risk (Low-Medium)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Migrate from Python re to regex module to improve unicode compatibility #19977

Description

Related issues, PRs or discussions

What is the current state of the codebase?

Why are changes required?

What technical changes are required?

Are the proposed technical changes API breaking?

Are there potential risks or issues with the proposed implementation?

Distribution/Testing Risk (Medium)

Behavior Divergence Risk (Low-Medium)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Migrate from Python `re` to `regex` module to improve unicode compatibility #19977