Tracks retrieval quality and latency across the three built-in model profiles
(light, medium, full). Run with scripts/benchmark_profiles.py.
Hardware: Apple M2 (8-core CPU, 8-core GPU) — MacBook Air 2022 Python: 3.12 Query set: 12 queries across all five documentation sources (see Query set below)
Note: The
fullprofile (BAAI/bge-m3, ~1.5 GB weights) was intentionally skipped on this machine. During indexing, bge-m3 materialises intermediate attention tensors in float32 on CPU (MPS lacks full bge-m3 op support), temporarily consuming 5–10× the model's static size and pushing the M2's unified memory into heavy swap. The benchmark already showsmediumoutperformsfullon this English-only corpus, so there is no quality benefit to running it here.
| Profile | Embed model | Dim | HR@5 | MRR@5 | Avg latency | Est. RAM |
|---|---|---|---|---|---|---|
light |
BAAI/bge-small-en-v1.5 | 384 | 100% | 0.933 | 1.25 s | ~200 MB |
medium |
BAAI/bge-base-en-v1.5 | 768 | 100% | 0.938 | 1.31 s | ~600 MB |
full |
BAAI/bge-m3 | 1024 | — | — | — (skipped — OOM risk on M2) | ~1 800 MB |
All profiles use the reranker cross-encoder/ms-marco-MiniLM-L-6-v2 (light/medium).
-
Both runnable profiles hit 100% HR@5. Same result as the CUDA machine — the index quality is hardware-independent.
-
mediumagain has the best MRR@5 (0.938). Matches the CUDA result exactly. -
Latency is ~1.25–1.31 s per query on M2. This is comparable to the CUDA baseline (1.04–1.19 s) despite being CPU-only; the M2's unified memory bandwidth keeps query latency competitive for the
lightandmediummodels. -
fullshould be run on the Windows machine (4070 Super) where VRAM handles tensor expansion off the main memory bus. Thelancedb_fullindex should live there; the M2 only needslancedb_medium.
| Scenario | Recommended profile |
|---|---|
| Memory-constrained (< 1 GB unified memory available) | light |
| Standard usage on M2 | medium — best MRR, safe memory footprint |
| Non-English docs or multilingual queries | full on a CUDA machine only |
Hardware: NVIDIA GPU (CUDA) Python: 3.12.10 Query set: 12 queries across all five documentation sources (see Query set below)
| Profile | Embed model | Dim | HR@5 | MRR@5 | Avg latency | Est. RAM |
|---|---|---|---|---|---|---|
light |
BAAI/bge-small-en-v1.5 | 384 | 100% | 0.933 | 1.04 s | ~200 MB |
medium |
BAAI/bge-base-en-v1.5 | 768 | 100% | 0.938 | 1.19 s | ~600 MB |
full |
BAAI/bge-m3 | 1024 | 100% | 0.889 | 4.58 s | ~1 800 MB |
All three profiles share the same reranker (cross-encoder/ms-marco-MiniLM-L-6-v2 for light/medium,
BAAI/bge-reranker-base for full).
| Metric | Definition |
|---|---|
| HR@5 (Hit Rate at 5) | Fraction of queries where at least one relevant chunk appears in the top-5 results. A query is a "hit" if any of its relevant substrings appear (case-insensitive) in any result text. |
| MRR@5 (Mean Reciprocal Rank at 5) | Average of 1/(rank of first hit) across queries. A first hit at rank 1 scores 1.0; at rank 2 scores 0.5; no hit scores 0.0. Measures how high up relevant content appears, not just whether it appears. |
| Avg latency | Wall-clock time per query including ANN search and reranking, measured on the benchmark host. |
| Est. RAM | Approximate resident-set-size increase from loading the embedding model and reranker, as reported by the model profile definition. |
-
All three profiles hit 100% HR@5. Every query has at least one relevant document in its top-5 results, so the index covers the corpus well regardless of profile choice.
-
mediumhas the best MRR@5 (0.938). It ranks relevant results higher on average than bothlightandfull, while adding only ~0.15 s per query overlight. -
full(bge-m3) scores the lowest MRR (0.889) despite being the largest model.bge-m3is a multilingual model; the Plesk documentation corpus is English-only, which appears to disadvantage it against the English-specializedbge-base-en-v1.5used bymedium. If your corpus includes non-English documentation,fullmay recover its quality advantage. -
fullis ~3.8–4.4× slower thanlight/mediumon CUDA (4.58 s vs. 1.04–1.19 s per query). The latency gap would be larger on CPU. -
RSS delta reported as 0 MB because
psutilis not installed in the benchmark environment. The Est. RAM column above uses the profile's static estimate instead.
| Scenario | Recommended profile |
|---|---|
| Memory-constrained host (< 1 GB) | light |
| Most production deployments | medium — best MRR, moderate latency |
| Non-English docs or multilingual queries | full |
The benchmark uses 12 hand-labelled queries spread across all five sources. Each query has a list of keyword substrings that must appear in at least one top-5 result to count as a hit.
| # | Query | Category | Relevant keywords |
|---|---|---|---|
| 1 | how to define default config settings for a Plesk extension | php-stubs | ConfigDefaults, getDefaults |
| 2 | retrieve extension configuration values | php-stubs | pm_Config, getDefaults |
| 3 | hook interface for Plesk modules | php-stubs | pm_Hook_Interface, Hook |
| 4 | restart Plesk service from command line | cli | plesk repair, restart |
| 5 | create a new subscription via CLI | cli | subscription, add |
| 6 | list all domains via Plesk REST API | api | GET /domains, /api/v2/domains |
| 7 | authenticate with Plesk API using secret key | api | X-API-Key, secret_key, Authorization |
| 8 | add a custom button to Plesk panel | guide | button, custom_buttons, addButton |
| 9 | package a Plesk extension for distribution | guide | plesk ext, package, .zip |
| 10 | register a new page in Plesk JS SDK | js-sdk | registerPage, router |
| 11 | SSL certificate management | (all) | certificate, SSL, TLS |
| 12 | backup and restore Plesk | (all) | backup, restore |
The built-in query sets live in plesk_unified/benchmark_suites.py.
You can provide your own queries with --queries my_queries.json (see the script docstring for format).
| Source | Files | Approx chunks |
|---|---|---|
| php-stubs | 124 | ~139 |
| js-sdk | 53 | ~80 |
| api | 466 | ~1 139 |
| cli | 81 | ~582 |
| guide | 105 | ~281 |
| Total | 829 | ~2 221 |
# Activate the virtual environment
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Full benchmark — re-indexes every profile then runs retrieval queries
python scripts/benchmark_profiles.py --refresh
# Single profile only
python scripts/benchmark_profiles.py --refresh --profiles medium
# Custom query file
python scripts/benchmark_profiles.py --queries my_queries.json
# Save full JSON results
python scripts/benchmark_profiles.py --refresh --output results.jsonTo reproduce the exact conditions of the table above, run with --refresh on a freshly
cloned repository so each profile indexes from scratch. Omit --refresh for fast
re-runs against existing indexes.
Note: RSS delta requires
psutil(pip install psutil). Without it the column shows 0.
The benchmark runner now also supports a structure-aware pilot engine that reranks the baseline candidates using title and breadcrumb signals:
python scripts/benchmark_profiles.py --engine pageindex-pilot --profile medium
python scripts/benchmark_profiles.py --autoresearch --repeat 3 --output pilot_runs.jsonThis is benchmark-only. It does not change the MCP runtime search path.
Use these when you want to test whether PageIndex is worth adding for specific query shapes:
# 1. Structural navigation queries
python scripts/benchmark_profiles.py --suite structural --profile medium --engine pageindex-pilot
# 2. Long-document QA queries
python scripts/benchmark_profiles.py --suite long-doc --profile medium --engine pageindex-pilot
# 3. Multi-hop retrieval queries
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --engine pageindex-pilotCurrent suite sizes:
structural: 4 querieslong-doc: 3 queriesmulti-hop: 26 queries (expanded pack)
Recommended interpretation:
structuraltells you whether PageIndex improves heading-aware retrieval.long-doctells you whether it helps on broad questions over longer pages.multi-hoptells you whether tree navigation helps with compound questions.
The benchmark runner now supports per-query routing between baseline and pageindex-pilot:
# Baseline behavior (manual engine only)
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy baseline-only --engine baseline
# Adaptive routing: route multi-hop/structural intents to pageindex-pilot, keep lookup intents on baseline
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy adaptive --engine baseline
# Aggressive routing: send every query to pageindex-pilot
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy aggressive --engine baselinePolicy guidance:
- Use
baseline-onlyfor control runs and regression tracking. - Use
adaptivefor realistic mixed-query evaluation. - Use
aggressiveas an upper-bound stress test for PageIndex-style reranking.
Observed on the medium profile during the initial pilot:
| Suite | Baseline MRR@5 | PageIndex pilot MRR@5 | Takeaway |
|---|---|---|---|
structural |
1.000 | 1.000 | No measurable gain; baseline already saturates this slice. |
long-doc |
1.000 | 1.000 | No change on this small set; needs a harder long-form corpus to differentiate. |
multi-hop |
0.750 | 1.000 | Best-looking slice for PageIndex-style navigation, but still small-N. |
Validation objective (completed):
Run the full control + multi-hop policy matrix over multiple repeats and make a default-routing decision from mean and variance, not from a single run.
Commands used:
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output /tmp/pageindex_matrix/control_baseline-only.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/control_adaptive.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/control_aggressive.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output /tmp/pageindex_matrix/multi-hop_baseline-only.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/multi-hop_adaptive.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/multi-hop_aggressive.jsonAggregated results (mean +- std):
| Suite | Policy | HR@5 | MRR@5 | Avg latency (s) | Pilot share | Delta MRR vs baseline |
|---|---|---|---|---|---|---|
control |
baseline-only |
1.000 +- 0.000 | 0.938 +- 0.000 | 1.532 +- 0.046 | 0.000 | +0.000 |
control |
adaptive |
1.000 +- 0.000 | 0.938 +- 0.000 | 1.527 +- 0.119 | 0.333 | +0.000 |
control |
aggressive |
1.000 +- 0.000 | 0.917 +- 0.000 | 1.631 +- 0.124 | 1.000 | -0.021 |
multi-hop |
baseline-only |
1.000 +- 0.000 | 0.940 +- 0.000 | 1.419 +- 0.033 | 0.000 | +0.000 |
multi-hop |
adaptive |
1.000 +- 0.000 | 0.891 +- 0.000 | 1.434 +- 0.027 | 1.000 | -0.049 |
multi-hop |
aggressive |
1.000 +- 0.000 | 0.891 +- 0.000 | 1.392 +- 0.013 | 1.000 | -0.049 |
Decision-gate outcomes:
- Multi-hop MRR does not improve with routing; it drops by
0.049for both routed policies. - Control-suite MRR regresses under
aggressive(-0.021). adaptiveroutes all expanded multi-hop queries to the pilot path, so its behavior equals aggressive on this suite and keeps the same MRR regression.
Final recommendation from this matrix:
- Keep
baseline-onlyas the default policy. - Keep
adaptivebehind an experiment flag only while routing heuristics are redesigned and revalidated. - Do not use
aggressivein production.
Use this checklist when rerunning the decision matrix after heuristic changes:
- Run control with all three routing policies:
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output control_baseline.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output control_adaptive.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output control_aggressive.json- Run expanded multi-hop with all three routing policies:
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output multihop_baseline.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output multihop_adaptive.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output multihop_aggressive.json- Promote routing only if all gates pass:
- multi-hop MRR remains consistently higher across repeats,
- control-suite MRR/HR do not regress materially,
- latency stays within the acceptable budget for your deployment target.