Retrieval Quality Benchmarks

Tracks retrieval quality and latency across the three built-in model profiles (light, medium, full). Run with scripts/benchmark_profiles.py.

Latest results — 2026-03-08 (Apple M2, MacBook Air 2022)

Hardware: Apple M2 (8-core CPU, 8-core GPU) — MacBook Air 2022 Python: 3.12 Query set: 12 queries across all five documentation sources (see Query set below)

Note: The full profile (BAAI/bge-m3, ~1.5 GB weights) was intentionally skipped on this machine. During indexing, bge-m3 materialises intermediate attention tensors in float32 on CPU (MPS lacks full bge-m3 op support), temporarily consuming 5–10× the model's static size and pushing the M2's unified memory into heavy swap. The benchmark already shows medium outperforms full on this English-only corpus, so there is no quality benefit to running it here.

Summary

Profile	Embed model	Dim	HR@5	MRR@5	Avg latency	Est. RAM
`light`	BAAI/bge-small-en-v1.5	384	100%	0.933	1.25 s	~200 MB
`medium`	BAAI/bge-base-en-v1.5	768	100%	0.938	1.31 s	~600 MB
`full`	BAAI/bge-m3	1024	—	—	— (skipped — OOM risk on M2)	~1 800 MB

All profiles use the reranker cross-encoder/ms-marco-MiniLM-L-6-v2 (light/medium).

Observations

Both runnable profiles hit 100% HR@5. Same result as the CUDA machine — the index quality is hardware-independent.
medium again has the best MRR@5 (0.938). Matches the CUDA result exactly.
Latency is ~1.25–1.31 s per query on M2. This is comparable to the CUDA baseline (1.04–1.19 s) despite being CPU-only; the M2's unified memory bandwidth keeps query latency competitive for the light and medium models.
full should be run on the Windows machine (4070 Super) where VRAM handles tensor expansion off the main memory bus. The lancedb_full index should live there; the M2 only needs lancedb_medium.

Recommendation (M2 / Apple Silicon)

Scenario	Recommended profile
Memory-constrained (< 1 GB unified memory available)	`light`
Standard usage on M2	`medium` — best MRR, safe memory footprint
Non-English docs or multilingual queries	`full` on a CUDA machine only

Previous results — 2026-03-08

Hardware: NVIDIA GPU (CUDA) Python: 3.12.10 Query set: 12 queries across all five documentation sources (see Query set below)

Summary

Profile	Embed model	Dim	HR@5	MRR@5	Avg latency	Est. RAM
`light`	BAAI/bge-small-en-v1.5	384	100%	0.933	1.04 s	~200 MB
`medium`	BAAI/bge-base-en-v1.5	768	100%	0.938	1.19 s	~600 MB
`full`	BAAI/bge-m3	1024	100%	0.889	4.58 s	~1 800 MB

All three profiles share the same reranker (cross-encoder/ms-marco-MiniLM-L-6-v2 for light/medium, BAAI/bge-reranker-base for full).

Metrics

Metric	Definition
HR@5 (Hit Rate at 5)	Fraction of queries where at least one relevant chunk appears in the top-5 results. A query is a "hit" if any of its `relevant` substrings appear (case-insensitive) in any result text.
MRR@5 (Mean Reciprocal Rank at 5)	Average of `1/(rank of first hit)` across queries. A first hit at rank 1 scores 1.0; at rank 2 scores 0.5; no hit scores 0.0. Measures how high up relevant content appears, not just whether it appears.
Avg latency	Wall-clock time per query including ANN search and reranking, measured on the benchmark host.
Est. RAM	Approximate resident-set-size increase from loading the embedding model and reranker, as reported by the model profile definition.

Observations

All three profiles hit 100% HR@5. Every query has at least one relevant document in its top-5 results, so the index covers the corpus well regardless of profile choice.
medium has the best MRR@5 (0.938). It ranks relevant results higher on average than both light and full, while adding only ~0.15 s per query over light.
full (bge-m3) scores the lowest MRR (0.889) despite being the largest model. bge-m3 is a multilingual model; the Plesk documentation corpus is English-only, which appears to disadvantage it against the English-specialized bge-base-en-v1.5 used by medium. If your corpus includes non-English documentation, full may recover its quality advantage.
full is ~3.8–4.4× slower than light/medium on CUDA (4.58 s vs. 1.04–1.19 s per query). The latency gap would be larger on CPU.
RSS delta reported as 0 MB because psutil is not installed in the benchmark environment. The Est. RAM column above uses the profile's static estimate instead.

Recommendation

Scenario	Recommended profile
Memory-constrained host (< 1 GB)	`light`
Most production deployments	`medium` — best MRR, moderate latency
Non-English docs or multilingual queries	`full`

Query set

The benchmark uses 12 hand-labelled queries spread across all five sources. Each query has a list of keyword substrings that must appear in at least one top-5 result to count as a hit.

#	Query	Category	Relevant keywords
1	how to define default config settings for a Plesk extension	php-stubs	ConfigDefaults, getDefaults
2	retrieve extension configuration values	php-stubs	pm_Config, getDefaults
3	hook interface for Plesk modules	php-stubs	pm_Hook_Interface, Hook
4	restart Plesk service from command line	cli	plesk repair, restart
5	create a new subscription via CLI	cli	subscription, add
6	list all domains via Plesk REST API	api	GET /domains, /api/v2/domains
7	authenticate with Plesk API using secret key	api	X-API-Key, secret_key, Authorization
8	add a custom button to Plesk panel	guide	button, custom_buttons, addButton
9	package a Plesk extension for distribution	guide	plesk ext, package, .zip
10	register a new page in Plesk JS SDK	js-sdk	registerPage, router
11	SSL certificate management	(all)	certificate, SSL, TLS
12	backup and restore Plesk	(all)	backup, restore

The built-in query sets live in plesk_unified/benchmark_suites.py. You can provide your own queries with --queries my_queries.json (see the script docstring for format).

Index statistics (at time of benchmark)

Source	Files	Approx chunks
php-stubs	124	~139
js-sdk	53	~80
api	466	~1 139
cli	81	~582
guide	105	~281
Total	829	~2 221

How to reproduce

# Activate the virtual environment
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Full benchmark — re-indexes every profile then runs retrieval queries
python scripts/benchmark_profiles.py --refresh

# Single profile only
python scripts/benchmark_profiles.py --refresh --profiles medium

# Custom query file
python scripts/benchmark_profiles.py --queries my_queries.json

# Save full JSON results
python scripts/benchmark_profiles.py --refresh --output results.json

To reproduce the exact conditions of the table above, run with --refresh on a freshly cloned repository so each profile indexes from scratch. Omit --refresh for fast re-runs against existing indexes.

Note: RSS delta requires psutil (pip install psutil). Without it the column shows 0.

Experimental PageIndex-style pilot

The benchmark runner now also supports a structure-aware pilot engine that reranks the baseline candidates using title and breadcrumb signals:

python scripts/benchmark_profiles.py --engine pageindex-pilot --profile medium
python scripts/benchmark_profiles.py --autoresearch --repeat 3 --output pilot_runs.json

This is benchmark-only. It does not change the MCP runtime search path.

PageIndex benchmark suites

Use these when you want to test whether PageIndex is worth adding for specific query shapes:

# 1. Structural navigation queries
python scripts/benchmark_profiles.py --suite structural --profile medium --engine pageindex-pilot

# 2. Long-document QA queries
python scripts/benchmark_profiles.py --suite long-doc --profile medium --engine pageindex-pilot

# 3. Multi-hop retrieval queries
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --engine pageindex-pilot

Current suite sizes:

structural: 4 queries
long-doc: 3 queries
multi-hop: 26 queries (expanded pack)

Recommended interpretation:

structural tells you whether PageIndex improves heading-aware retrieval.
long-doc tells you whether it helps on broad questions over longer pages.
multi-hop tells you whether tree navigation helps with compound questions.

Automatic query routing policies

The benchmark runner now supports per-query routing between baseline and pageindex-pilot:

# Baseline behavior (manual engine only)
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy baseline-only --engine baseline

# Adaptive routing: route multi-hop/structural intents to pageindex-pilot, keep lookup intents on baseline
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy adaptive --engine baseline

# Aggressive routing: send every query to pageindex-pilot
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy aggressive --engine baseline

Policy guidance:

Use baseline-only for control runs and regression tracking.
Use adaptive for realistic mixed-query evaluation.
Use aggressive as an upper-bound stress test for PageIndex-style reranking.

Observed on the medium profile during the initial pilot:

Suite	Baseline MRR@5	PageIndex pilot MRR@5	Takeaway
`structural`	1.000	1.000	No measurable gain; baseline already saturates this slice.
`long-doc`	1.000	1.000	No change on this small set; needs a harder long-form corpus to differentiate.
`multi-hop`	0.750	1.000	Best-looking slice for PageIndex-style navigation, but still small-N.

Validation objective (completed):

Run the full control + multi-hop policy matrix over multiple repeats and make a default-routing decision from mean and variance, not from a single run.

Final routing matrix — 2026-04-06 (medium profile, 3 repeats each)

Commands used:

python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output /tmp/pageindex_matrix/control_baseline-only.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/control_adaptive.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/control_aggressive.json

python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output /tmp/pageindex_matrix/multi-hop_baseline-only.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/multi-hop_adaptive.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/multi-hop_aggressive.json

Aggregated results (mean +- std):

Suite	Policy	HR@5	MRR@5	Avg latency (s)	Pilot share	Delta MRR vs baseline
`control`	`baseline-only`	1.000 +- 0.000	0.938 +- 0.000	1.532 +- 0.046	0.000	+0.000
`control`	`adaptive`	1.000 +- 0.000	0.938 +- 0.000	1.527 +- 0.119	0.333	+0.000
`control`	`aggressive`	1.000 +- 0.000	0.917 +- 0.000	1.631 +- 0.124	1.000	-0.021
`multi-hop`	`baseline-only`	1.000 +- 0.000	0.940 +- 0.000	1.419 +- 0.033	0.000	+0.000
`multi-hop`	`adaptive`	1.000 +- 0.000	0.891 +- 0.000	1.434 +- 0.027	1.000	-0.049
`multi-hop`	`aggressive`	1.000 +- 0.000	0.891 +- 0.000	1.392 +- 0.013	1.000	-0.049

Decision-gate outcomes:

Multi-hop MRR does not improve with routing; it drops by 0.049 for both routed policies.
Control-suite MRR regresses under aggressive (-0.021).
adaptive routes all expanded multi-hop queries to the pilot path, so its behavior equals aggressive on this suite and keeps the same MRR regression.

Final recommendation from this matrix:

Keep baseline-only as the default policy.
Keep adaptive behind an experiment flag only while routing heuristics are redesigned and revalidated.
Do not use aggressive in production.

Rollout checklist

Use this checklist when rerunning the decision matrix after heuristic changes:

Run control with all three routing policies:

python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output control_baseline.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output control_adaptive.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output control_aggressive.json

Run expanded multi-hop with all three routing policies:

python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output multihop_baseline.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output multihop_adaptive.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output multihop_aggressive.json

Promote routing only if all gates pass:

multi-hop MRR remains consistently higher across repeats,
control-suite MRR/HR do not regress materially,
latency stays within the acceptable budget for your deployment target.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval Quality Benchmarks

Latest results — 2026-03-08 (Apple M2, MacBook Air 2022)

Summary

Observations

Recommendation (M2 / Apple Silicon)

Previous results — 2026-03-08

Summary

Metrics

Observations

Recommendation

Query set

Index statistics (at time of benchmark)

How to reproduce

Experimental PageIndex-style pilot

PageIndex benchmark suites

Automatic query routing policies

Final routing matrix — 2026-04-06 (medium profile, 3 repeats each)

Rollout checklist

FilesExpand file tree

benchmarks.md

Latest commit

History

benchmarks.md

File metadata and controls

Retrieval Quality Benchmarks

Latest results — 2026-03-08 (Apple M2, MacBook Air 2022)

Summary

Observations

Recommendation (M2 / Apple Silicon)

Previous results — 2026-03-08

Summary

Metrics

Observations

Recommendation

Query set

Index statistics (at time of benchmark)

How to reproduce

Experimental PageIndex-style pilot

PageIndex benchmark suites

Automatic query routing policies

Final routing matrix — 2026-04-06 (medium profile, 3 repeats each)

Rollout checklist