豆豆友情提示:这是一个非官方 GitHub 代理镜像,主要用于网络测试或访问加速。请勿在此进行登录、注册或处理任何敏感信息。进行这些操作请务必访问官方网站 github.com。 Raw 内容也通过此代理提供。
Skip to content

fix: limit analyzer output token positions during indexing#146804

Draft
salvatore-campagna wants to merge 6 commits intoelastic:mainfrom
salvatore-campagna:fix/es-14629-analyzer-token-limit
Draft

fix: limit analyzer output token positions during indexing#146804
salvatore-campagna wants to merge 6 commits intoelastic:mainfrom
salvatore-campagna:fix/es-14629-analyzer-token-limit

Conversation

@salvatore-campagna
Copy link
Copy Markdown
Contributor

@salvatore-campagna salvatore-campagna commented Apr 20, 2026

Summary

This PR prevents OOM from excessive token generation during indexing by limiting the number of token positions an analyzer can produce per field value. Unlike PR #146497 and PR #146617 which target n-gram filters specifically, this PR applies the limit at the analyzer level, protecting against excessive tokens from any source.

The limit is based on token positions, preserving complete coverage per position (e.g., all n-grams of a word are either fully present or fully absent).

Problem

Any analyzer can produce a large number of tokens from a large input field, causing Lucene's in-memory posting structures to grow unbounded, potentially leading to OOM. Previous approaches in PR #146497 and PR #146617 targeted n-gram filters specifically, but the problem is more general.

Breaking change

When index.max_indexed_token_count is set below the default, tokens at positions beyond the limit are silently discarded and some text content may not be fully searchable.

For stateful clusters, the default is no limit, so no existing behavior changes. Users can opt in when they want the protection. For serverless, the default must be overridden to prevent OOM, which will affect users whose fields produce more tokens than the chosen limit.

How it works

LimitTokenPositionAnalyzer wraps the index analyzer in IndexShard.buildIndexAnalyzer. It counts input token boundaries using the PositionIncrementAttribute (each positionIncrement > 0 is one boundary regardless of increment value). Tokens sharing a position (such as n-gram expansions or synonyms) pass through as long as the source position is within the limit.

Lucene provides LimitTokenPositionFilter but it tracks cumulative position values. When stop filters remove tokens, they increase the position increment to preserve gaps, and those removed tokens still consume budget, causing the limit to kick in earlier than expected. Since the goal is to limit what gets stored in Lucene's posting structures, and stop words are already removed before reaching those structures, counting them wastes capacity. This wrapper counts actual token boundaries instead.

Changes

  • Add LimitTokenPositionAnalyzer that wraps any analyzer with position-based token limiting
  • Add index.max_indexed_token_count setting (default no limit, dynamic, index-scoped)
  • Always wrap the index analyzer in IndexShard.buildIndexAnalyzer
  • Classify the setting as non-replicated in CCR
  • Add docs

Backwards compatibility

No existing behavior changes on stateful clusters. Serverless can override the default in a separate PR.

Test plan

./gradlew :server:test --tests "*LimitTokenPositionAnalyzerTests*"
./gradlew :modules:analysis-common:internalClusterTest --tests "*MaxIndexedTokenCountIT*"
./gradlew :modules:analysis-common:yamlRestTest --tests "*.CommonAnalysisClientYamlTestSuiteIT.test {yaml=analysis-common/40_token_filters/max_indexed*}"
./gradlew :x-pack:plugin:ccr:test --tests "*TransportResumeFollowActionTests.testDynamicIndexSettingsAreClassified*"

Add LimitTokenPositionAnalyzer that wraps the index analyzer and
limits the number of token positions emitted per field value.
Tokens at positions beyond the limit are silently discarded,
preserving complete coverage per position. Applied at the
analyzer level in IndexShard.buildIndexAnalyzer so it covers all
tokenizers and token filters.

Controlled by index.max_indexed_token_count (default MAX_VALUE,
effectively no limit). Serverless can override the default to a
lower value.
Remove the conditional check and always wrap the index analyzer
with LimitTokenPositionAnalyzer. With the default MAX_VALUE the
wrapper passes everything through, so every index uses the same
code path.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 20, 2026

🔍 Preview links for changed docs

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 20, 2026

✅ Vale Linting Results

No issues found on modified lines!


The Vale linter checks documentation changes against the Elastic Docs style guide.

To use Vale locally or report issues, refer to Elastic style guide for Vale.

@github-actions
Copy link
Copy Markdown
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

);
}
};
return new LimitTokenPositionAnalyzer(analyzer, mapperService.getIndexSettings().getMaxIndexedTokenCount());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analyzer is always wrapped with LimitTokenPositionAnalyzer even when the limit is Integer.MAX_VALUE. This keeps every index on the same code path. An alternative would be to conditionally wrap only when the limit is below MAX_VALUE to avoid the per-token overhead of the filter's incrementToken call. Happy to change this if there are performance concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants