[代理镜像] fix: limit analyzer output token positions during indexing by salvatore-campagna · Pull Request #146804 · elastic/elasticsearch

salvatore-campagna · 2026-04-20T11:41:45Z

Summary

This PR prevents OOM from excessive token generation during indexing by limiting the number of token positions an analyzer can produce per field value. Unlike PR #146497 and PR #146617 which target n-gram filters specifically, this PR applies the limit at the analyzer level, protecting against excessive tokens from any source.

The limit is based on token positions, preserving complete coverage per position (e.g., all n-grams of a word are either fully present or fully absent).

Problem

Any analyzer can produce a large number of tokens from a large input field, causing Lucene's in-memory posting structures to grow unbounded, potentially leading to OOM. Previous approaches in PR #146497 and PR #146617 targeted n-gram filters specifically, but the problem is more general.

Breaking change

When index.max_indexed_token_count is set below the default, tokens at positions beyond the limit are silently discarded and some text content may not be fully searchable.

For stateful clusters, the default is no limit, so no existing behavior changes. Users can opt in when they want the protection. For serverless, the default must be overridden to prevent OOM, which will affect users whose fields produce more tokens than the chosen limit.

How it works

LimitTokenPositionAnalyzer wraps the index analyzer in IndexShard.buildIndexAnalyzer. It counts input token boundaries using the PositionIncrementAttribute (each positionIncrement > 0 is one boundary regardless of increment value). Tokens sharing a position (such as n-gram expansions or synonyms) pass through as long as the source position is within the limit.

Lucene provides LimitTokenPositionFilter but it tracks cumulative position values. When stop filters remove tokens, they increase the position increment to preserve gaps, and those removed tokens still consume budget, causing the limit to kick in earlier than expected. Since the goal is to limit what gets stored in Lucene's posting structures, and stop words are already removed before reaching those structures, counting them wastes capacity. This wrapper counts actual token boundaries instead.

Changes

Add LimitTokenPositionAnalyzer that wraps any analyzer with position-based token limiting
Add index.max_indexed_token_count setting (default no limit, dynamic, index-scoped)
Always wrap the index analyzer in IndexShard.buildIndexAnalyzer
Classify the setting as non-replicated in CCR
Add docs

Backwards compatibility

No existing behavior changes on stateful clusters. Serverless can override the default in a separate PR.

Test plan

./gradlew :server:test --tests "*LimitTokenPositionAnalyzerTests*"
./gradlew :modules:analysis-common:internalClusterTest --tests "*MaxIndexedTokenCountIT*"
./gradlew :modules:analysis-common:yamlRestTest --tests "*.CommonAnalysisClientYamlTestSuiteIT.test {yaml=analysis-common/40_token_filters/max_indexed*}"
./gradlew :x-pack:plugin:ccr:test --tests "*TransportResumeFollowActionTests.testDynamicIndexSettingsAreClassified*"

Add LimitTokenPositionAnalyzer that wraps the index analyzer and limits the number of token positions emitted per field value. Tokens at positions beyond the limit are silently discarded, preserving complete coverage per position. Applied at the analyzer level in IndexShard.buildIndexAnalyzer so it covers all tokenizers and token filters. Controlled by index.max_indexed_token_count (default MAX_VALUE, effectively no limit). Serverless can override the default to a lower value.

Remove the conditional check and always wrap the index analyzer with LimitTokenPositionAnalyzer. With the default MAX_VALUE the wrapper passes everything through, so every index uses the same code path.

github-actions · 2026-04-20T11:45:38Z

🔍 Preview links for changed docs

docs/reference/elasticsearch/index-settings/index-modules.md

github-actions · 2026-04-20T11:45:44Z

✅ Vale Linting Results

No issues found on modified lines!

The Vale linter checks documentation changes against the Elastic Docs style guide.

To use Vale locally or report issues, refer to Elastic style guide for Vale.

github-actions · 2026-04-20T11:47:10Z

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

Check out the cumulative docs guidelines
Reach out in the #docs Slack channel

salvatore-campagna · 2026-04-20T11:57:54Z

                );
            }
        };
+        return new LimitTokenPositionAnalyzer(analyzer, mapperService.getIndexSettings().getMaxIndexedTokenCount());


The analyzer is always wrapped with LimitTokenPositionAnalyzer even when the limit is Integer.MAX_VALUE. This keeps every index on the same code path. An alternative would be to conditionally wrap only when the limit is below MAX_VALUE to avoid the per-token overhead of the filter's incrementToken call. Happy to change this if there are performance concerns.

salvatore-campagna added 2 commits April 20, 2026 13:29

fix: always wrap analyzer with position limiter

b7ca2ca

Remove the conditional check and always wrap the index analyzer with LimitTokenPositionAnalyzer. With the default MAX_VALUE the wrapper passes everything through, so every index uses the same code path.

salvatore-campagna self-assigned this Apr 20, 2026

elasticsearchmachine added the v9.5.0 label Apr 20, 2026

github-actions bot deployed to docs-preview April 20, 2026 11:46 View deployment

fix: remove test that only exercises Lucene code

a28bd9a

Merge branch 'main' into fix/es-14629-analyzer-token-limit

385d8d2

github-actions bot deployed to docs-preview April 20, 2026 11:51 View deployment

salvatore-campagna commented Apr 20, 2026

View reviewed changes

Merge branch 'main' into fix/es-14629-analyzer-token-limit

fd35186

github-actions bot deployed to docs-preview April 20, 2026 12:35 View deployment

fix: improve docs and add YAML REST test

ed24cb4

github-actions bot deployed to docs-preview April 20, 2026 13:19 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: limit analyzer output token positions during indexing#146804

fix: limit analyzer output token positions during indexing#146804
salvatore-campagna wants to merge 6 commits intoelastic:mainfrom
salvatore-campagna:fix/es-14629-analyzer-token-limit

salvatore-campagna commented Apr 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 20, 2026

When to use applies_to tags:

What NOT to do:

Uh oh!

salvatore-campagna Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

salvatore-campagna commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Breaking change

How it works

Changes

Backwards compatibility

Test plan

Uh oh!

github-actions bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Preview links for changed docs

Uh oh!

github-actions bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Vale Linting Results

Uh oh!

github-actions bot commented Apr 20, 2026

ℹ️ Important: Docs version tagging

When to use applies_to tags:

What NOT to do:

🤔 Need help?

Uh oh!

salvatore-campagna Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

salvatore-campagna commented Apr 20, 2026 •

edited

Loading

github-actions bot commented Apr 20, 2026 •

edited

Loading

github-actions bot commented Apr 20, 2026 •

edited

Loading