fix: limit analyzer output token positions during indexing#146804
fix: limit analyzer output token positions during indexing#146804salvatore-campagna wants to merge 6 commits intoelastic:mainfrom
Conversation
Add LimitTokenPositionAnalyzer that wraps the index analyzer and limits the number of token positions emitted per field value. Tokens at positions beyond the limit are silently discarded, preserving complete coverage per position. Applied at the analyzer level in IndexShard.buildIndexAnalyzer so it covers all tokenizers and token filters. Controlled by index.max_indexed_token_count (default MAX_VALUE, effectively no limit). Serverless can override the default to a lower value.
Remove the conditional check and always wrap the index analyzer with LimitTokenPositionAnalyzer. With the default MAX_VALUE the wrapper passes everything through, so every index uses the same code path.
🔍 Preview links for changed docs |
✅ Vale Linting ResultsNo issues found on modified lines! The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale. |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
| ); | ||
| } | ||
| }; | ||
| return new LimitTokenPositionAnalyzer(analyzer, mapperService.getIndexSettings().getMaxIndexedTokenCount()); |
There was a problem hiding this comment.
The analyzer is always wrapped with LimitTokenPositionAnalyzer even when the limit is Integer.MAX_VALUE. This keeps every index on the same code path. An alternative would be to conditionally wrap only when the limit is below MAX_VALUE to avoid the per-token overhead of the filter's incrementToken call. Happy to change this if there are performance concerns.
Summary
This PR prevents OOM from excessive token generation during indexing by limiting the number of token positions an analyzer can produce per field value. Unlike PR #146497 and PR #146617 which target n-gram filters specifically, this PR applies the limit at the analyzer level, protecting against excessive tokens from any source.
The limit is based on token positions, preserving complete coverage per position (e.g., all n-grams of a word are either fully present or fully absent).
Problem
Any analyzer can produce a large number of tokens from a large input field, causing Lucene's in-memory posting structures to grow unbounded, potentially leading to OOM. Previous approaches in PR #146497 and PR #146617 targeted n-gram filters specifically, but the problem is more general.
Breaking change
When
index.max_indexed_token_countis set below the default, tokens at positions beyond the limit are silently discarded and some text content may not be fully searchable.For stateful clusters, the default is no limit, so no existing behavior changes. Users can opt in when they want the protection. For serverless, the default must be overridden to prevent OOM, which will affect users whose fields produce more tokens than the chosen limit.
How it works
LimitTokenPositionAnalyzerwraps the index analyzer inIndexShard.buildIndexAnalyzer. It counts input token boundaries using thePositionIncrementAttribute(eachpositionIncrement > 0is one boundary regardless of increment value). Tokens sharing a position (such as n-gram expansions or synonyms) pass through as long as the source position is within the limit.Lucene provides
LimitTokenPositionFilterbut it tracks cumulative position values. When stop filters remove tokens, they increase the position increment to preserve gaps, and those removed tokens still consume budget, causing the limit to kick in earlier than expected. Since the goal is to limit what gets stored in Lucene's posting structures, and stop words are already removed before reaching those structures, counting them wastes capacity. This wrapper counts actual token boundaries instead.Changes
LimitTokenPositionAnalyzerthat wraps any analyzer with position-based token limitingindex.max_indexed_token_countsetting (default no limit, dynamic, index-scoped)IndexShard.buildIndexAnalyzerBackwards compatibility
No existing behavior changes on stateful clusters. Serverless can override the default in a separate PR.
Test plan