豆豆友情提示:这是一个非官方 GitHub 代理镜像,主要用于网络测试或访问加速。请勿在此进行登录、注册或处理任何敏感信息。进行这些操作请务必访问官方网站 github.com。 Raw 内容也通过此代理提供。
Skip to content

Latest commit

 

History

History
314 lines (222 loc) · 14.1 KB

File metadata and controls

314 lines (222 loc) · 14.1 KB

Retrieval Quality Benchmarks

Tracks retrieval quality and latency across the three built-in model profiles (light, medium, full). Run with scripts/benchmark_profiles.py.


Latest results — 2026-03-08 (Apple M2, MacBook Air 2022)

Hardware: Apple M2 (8-core CPU, 8-core GPU) — MacBook Air 2022 Python: 3.12 Query set: 12 queries across all five documentation sources (see Query set below)

Note: The full profile (BAAI/bge-m3, ~1.5 GB weights) was intentionally skipped on this machine. During indexing, bge-m3 materialises intermediate attention tensors in float32 on CPU (MPS lacks full bge-m3 op support), temporarily consuming 5–10× the model's static size and pushing the M2's unified memory into heavy swap. The benchmark already shows medium outperforms full on this English-only corpus, so there is no quality benefit to running it here.

Summary

Profile Embed model Dim HR@5 MRR@5 Avg latency Est. RAM
light BAAI/bge-small-en-v1.5 384 100% 0.933 1.25 s ~200 MB
medium BAAI/bge-base-en-v1.5 768 100% 0.938 1.31 s ~600 MB
full BAAI/bge-m3 1024 — (skipped — OOM risk on M2) ~1 800 MB

All profiles use the reranker cross-encoder/ms-marco-MiniLM-L-6-v2 (light/medium).

Observations

  1. Both runnable profiles hit 100% HR@5. Same result as the CUDA machine — the index quality is hardware-independent.

  2. medium again has the best MRR@5 (0.938). Matches the CUDA result exactly.

  3. Latency is ~1.25–1.31 s per query on M2. This is comparable to the CUDA baseline (1.04–1.19 s) despite being CPU-only; the M2's unified memory bandwidth keeps query latency competitive for the light and medium models.

  4. full should be run on the Windows machine (4070 Super) where VRAM handles tensor expansion off the main memory bus. The lancedb_full index should live there; the M2 only needs lancedb_medium.

Recommendation (M2 / Apple Silicon)

Scenario Recommended profile
Memory-constrained (< 1 GB unified memory available) light
Standard usage on M2 medium — best MRR, safe memory footprint
Non-English docs or multilingual queries full on a CUDA machine only

Previous results — 2026-03-08

Hardware: NVIDIA GPU (CUDA) Python: 3.12.10 Query set: 12 queries across all five documentation sources (see Query set below)

Summary

Profile Embed model Dim HR@5 MRR@5 Avg latency Est. RAM
light BAAI/bge-small-en-v1.5 384 100% 0.933 1.04 s ~200 MB
medium BAAI/bge-base-en-v1.5 768 100% 0.938 1.19 s ~600 MB
full BAAI/bge-m3 1024 100% 0.889 4.58 s ~1 800 MB

All three profiles share the same reranker (cross-encoder/ms-marco-MiniLM-L-6-v2 for light/medium, BAAI/bge-reranker-base for full).


Metrics

Metric Definition
HR@5 (Hit Rate at 5) Fraction of queries where at least one relevant chunk appears in the top-5 results. A query is a "hit" if any of its relevant substrings appear (case-insensitive) in any result text.
MRR@5 (Mean Reciprocal Rank at 5) Average of 1/(rank of first hit) across queries. A first hit at rank 1 scores 1.0; at rank 2 scores 0.5; no hit scores 0.0. Measures how high up relevant content appears, not just whether it appears.
Avg latency Wall-clock time per query including ANN search and reranking, measured on the benchmark host.
Est. RAM Approximate resident-set-size increase from loading the embedding model and reranker, as reported by the model profile definition.

Observations

  1. All three profiles hit 100% HR@5. Every query has at least one relevant document in its top-5 results, so the index covers the corpus well regardless of profile choice.

  2. medium has the best MRR@5 (0.938). It ranks relevant results higher on average than both light and full, while adding only ~0.15 s per query over light.

  3. full (bge-m3) scores the lowest MRR (0.889) despite being the largest model. bge-m3 is a multilingual model; the Plesk documentation corpus is English-only, which appears to disadvantage it against the English-specialized bge-base-en-v1.5 used by medium. If your corpus includes non-English documentation, full may recover its quality advantage.

  4. full is ~3.8–4.4× slower than light/medium on CUDA (4.58 s vs. 1.04–1.19 s per query). The latency gap would be larger on CPU.

  5. RSS delta reported as 0 MB because psutil is not installed in the benchmark environment. The Est. RAM column above uses the profile's static estimate instead.

Recommendation

Scenario Recommended profile
Memory-constrained host (< 1 GB) light
Most production deployments medium — best MRR, moderate latency
Non-English docs or multilingual queries full

Query set

The benchmark uses 12 hand-labelled queries spread across all five sources. Each query has a list of keyword substrings that must appear in at least one top-5 result to count as a hit.

# Query Category Relevant keywords
1 how to define default config settings for a Plesk extension php-stubs ConfigDefaults, getDefaults
2 retrieve extension configuration values php-stubs pm_Config, getDefaults
3 hook interface for Plesk modules php-stubs pm_Hook_Interface, Hook
4 restart Plesk service from command line cli plesk repair, restart
5 create a new subscription via CLI cli subscription, add
6 list all domains via Plesk REST API api GET /domains, /api/v2/domains
7 authenticate with Plesk API using secret key api X-API-Key, secret_key, Authorization
8 add a custom button to Plesk panel guide button, custom_buttons, addButton
9 package a Plesk extension for distribution guide plesk ext, package, .zip
10 register a new page in Plesk JS SDK js-sdk registerPage, router
11 SSL certificate management (all) certificate, SSL, TLS
12 backup and restore Plesk (all) backup, restore

The built-in query sets live in plesk_unified/benchmark_suites.py. You can provide your own queries with --queries my_queries.json (see the script docstring for format).


Index statistics (at time of benchmark)

Source Files Approx chunks
php-stubs 124 ~139
js-sdk 53 ~80
api 466 ~1 139
cli 81 ~582
guide 105 ~281
Total 829 ~2 221

How to reproduce

# Activate the virtual environment
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Full benchmark — re-indexes every profile then runs retrieval queries
python scripts/benchmark_profiles.py --refresh

# Single profile only
python scripts/benchmark_profiles.py --refresh --profiles medium

# Custom query file
python scripts/benchmark_profiles.py --queries my_queries.json

# Save full JSON results
python scripts/benchmark_profiles.py --refresh --output results.json

To reproduce the exact conditions of the table above, run with --refresh on a freshly cloned repository so each profile indexes from scratch. Omit --refresh for fast re-runs against existing indexes.

Note: RSS delta requires psutil (pip install psutil). Without it the column shows 0.

Experimental PageIndex-style pilot

The benchmark runner now also supports a structure-aware pilot engine that reranks the baseline candidates using title and breadcrumb signals:

python scripts/benchmark_profiles.py --engine pageindex-pilot --profile medium
python scripts/benchmark_profiles.py --autoresearch --repeat 3 --output pilot_runs.json

This is benchmark-only. It does not change the MCP runtime search path.

PageIndex benchmark suites

Use these when you want to test whether PageIndex is worth adding for specific query shapes:

# 1. Structural navigation queries
python scripts/benchmark_profiles.py --suite structural --profile medium --engine pageindex-pilot

# 2. Long-document QA queries
python scripts/benchmark_profiles.py --suite long-doc --profile medium --engine pageindex-pilot

# 3. Multi-hop retrieval queries
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --engine pageindex-pilot

Current suite sizes:

  • structural: 4 queries
  • long-doc: 3 queries
  • multi-hop: 26 queries (expanded pack)

Recommended interpretation:

  1. structural tells you whether PageIndex improves heading-aware retrieval.
  2. long-doc tells you whether it helps on broad questions over longer pages.
  3. multi-hop tells you whether tree navigation helps with compound questions.

Automatic query routing policies

The benchmark runner now supports per-query routing between baseline and pageindex-pilot:

# Baseline behavior (manual engine only)
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy baseline-only --engine baseline

# Adaptive routing: route multi-hop/structural intents to pageindex-pilot, keep lookup intents on baseline
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy adaptive --engine baseline

# Aggressive routing: send every query to pageindex-pilot
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy aggressive --engine baseline

Policy guidance:

  1. Use baseline-only for control runs and regression tracking.
  2. Use adaptive for realistic mixed-query evaluation.
  3. Use aggressive as an upper-bound stress test for PageIndex-style reranking.

Observed on the medium profile during the initial pilot:

Suite Baseline MRR@5 PageIndex pilot MRR@5 Takeaway
structural 1.000 1.000 No measurable gain; baseline already saturates this slice.
long-doc 1.000 1.000 No change on this small set; needs a harder long-form corpus to differentiate.
multi-hop 0.750 1.000 Best-looking slice for PageIndex-style navigation, but still small-N.

Validation objective (completed):

Run the full control + multi-hop policy matrix over multiple repeats and make a default-routing decision from mean and variance, not from a single run.

Final routing matrix — 2026-04-06 (medium profile, 3 repeats each)

Commands used:

python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output /tmp/pageindex_matrix/control_baseline-only.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/control_adaptive.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/control_aggressive.json

python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output /tmp/pageindex_matrix/multi-hop_baseline-only.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/multi-hop_adaptive.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output /tmp/pageindex_matrix/multi-hop_aggressive.json

Aggregated results (mean +- std):

Suite Policy HR@5 MRR@5 Avg latency (s) Pilot share Delta MRR vs baseline
control baseline-only 1.000 +- 0.000 0.938 +- 0.000 1.532 +- 0.046 0.000 +0.000
control adaptive 1.000 +- 0.000 0.938 +- 0.000 1.527 +- 0.119 0.333 +0.000
control aggressive 1.000 +- 0.000 0.917 +- 0.000 1.631 +- 0.124 1.000 -0.021
multi-hop baseline-only 1.000 +- 0.000 0.940 +- 0.000 1.419 +- 0.033 0.000 +0.000
multi-hop adaptive 1.000 +- 0.000 0.891 +- 0.000 1.434 +- 0.027 1.000 -0.049
multi-hop aggressive 1.000 +- 0.000 0.891 +- 0.000 1.392 +- 0.013 1.000 -0.049

Decision-gate outcomes:

  1. Multi-hop MRR does not improve with routing; it drops by 0.049 for both routed policies.
  2. Control-suite MRR regresses under aggressive (-0.021).
  3. adaptive routes all expanded multi-hop queries to the pilot path, so its behavior equals aggressive on this suite and keeps the same MRR regression.

Final recommendation from this matrix:

  1. Keep baseline-only as the default policy.
  2. Keep adaptive behind an experiment flag only while routing heuristics are redesigned and revalidated.
  3. Do not use aggressive in production.

Rollout checklist

Use this checklist when rerunning the decision matrix after heuristic changes:

  1. Run control with all three routing policies:
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output control_baseline.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output control_adaptive.json
python scripts/benchmark_profiles.py --suite control --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output control_aggressive.json
  1. Run expanded multi-hop with all three routing policies:
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy baseline-only --engine baseline --repeat 3 --output multihop_baseline.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy adaptive --engine baseline --repeat 3 --output multihop_adaptive.json
python scripts/benchmark_profiles.py --suite multi-hop --profile medium --routing-policy aggressive --engine baseline --repeat 3 --output multihop_aggressive.json
  1. Promote routing only if all gates pass:
  • multi-hop MRR remains consistently higher across repeats,
  • control-suite MRR/HR do not regress materially,
  • latency stays within the acceptable budget for your deployment target.