Routing supervision gold labels for workloads that actually burn tokens.
Chinese review copy: README.zh.md
CommonRouterBench provides per-step LLM routing gold labels for scenarios where routing actually matters: multi-step agents, long-context RAG, and vibecoding loops. Instead of relying on flawed LLM-as-judge scores or single-shot questions, our labels are generated by analyzing successful traces and performing a sequential downgrade search to find the cheapest capability tier that still passes the rigorous task checks.
If you are building routers for high-consumption, multi-turn scenarios, you can use this dataset to evaluate your router or train tier predictors using realistic prefixes (conversation + tool outputs + diffs).
1. Install the package
# From this directory (editable, for development):
pip install -e .
# Or from PyPI (once published):
pip install CommonRouterBench(Note: The PyPI distribution is CommonRouterBench, but you import main in your code.)
2. Evaluate a router
from main.eval import FunctionPredictor, run_question_bank_eval
# Example: an oracle that always predicts the gold tier
oracle = FunctionPredictor(lambda row: row["target_tier_id"])
summary = run_question_bank_eval(oracle, n=20, seed=1)
# Headline: four component scores plus their arithmetic mean.
print(summary["scores_v2"])- What is shipped: This directory is the only part of the repository intended for open source. Published artifacts ship the
mainpackage, thedata/question bank, and documentation. Private test harnesses are excluded. - Version:
0.1.0(see CHANGELOG.md). - Dependencies: The core package depends on
requests(HTTP helpers inmain.router_llm),tiktoken(fallback token counting), andtokenizers(HuggingFace — native vendor tokenizers formain.tokenizer). - Local tests: You may keep a
tests/directory besidepyproject.tomlfor pytest; it is.gitignored.
| Path | Purpose |
|---|---|
main/ |
Published Python package (import main). |
data/ |
question_bank.jsonl and manifest.json (bundled in wheels when present at build time). |
Regenerating data/ from private benchmark exports is out of scope for the published package; keep your own merge tooling outside this tree if needed.
Artifacts under data/:
data/question_bank.jsonl— all routing-step records in one file (no per-benchmark subdirectories).data/manifest.json— per-source line counts and schema metadata.
Each line includes a string field benchmark (e.g. swebench, mtrag) for filtering.
The question bank does not include fields such as optimal_model or baseline_model. Supervision is only by capability tier, using English labels and a numeric id.
target_tier (string) |
target_tier_id (int) |
Tier (plain English) |
|---|---|---|
low |
0 | Low |
mid |
1 | Medium |
mid_high |
2 | Medium-high |
high |
3 | High |
Chinese tier label equivalents for the English target_tier strings are summarized in README.zh.md.
Each line includes at least: id, benchmark, scenario, instance_id, step_index, total_steps, messages, target_tier, target_tier_id.
The following counts match the data/question_bank.jsonl and data/manifest.json shipped in this repository (970 routing-step rows). Rebuilding the bank from private exports may change these figures.
For BFCL, the public corpus now includes both single-turn and multi-turn routing supervision rows.
benchmark |
Rows | Share of bank |
|---|---|---|
swebench |
336 | 34.6% |
bfcl |
248 | 25.6% |
mtrag |
193 | 19.9% |
qmsum |
145 | 14.9% |
pinchbench |
48 | 4.9% |
| Total | 970 | 100% |
target_tier |
target_tier_id |
Rows | Share |
|---|---|---|---|
low |
0 | 689 | 71.0% |
mid |
1 | 62 | 6.4% |
mid_high |
2 | 49 | 5.1% |
high |
3 | 170 | 17.5% |
| Total | — | 970 | 100% |
benchmark |
Rows | low |
mid |
mid_high |
high |
|---|---|---|---|---|---|
bfcl |
248 | 239 | 8 | 1 | 0 |
mtrag |
193 | 183 | 8 | 1 | 1 |
pinchbench |
48 | 41 | 3 | 3 | 1 |
qmsum |
145 | 132 | 10 | 3 | 0 |
swebench |
336 | 94 | 33 | 41 | 168 |
Authoritative values live in main.pricing: TIER_OUTPUT_USD_PER_1M, TIER_INPUT_USD_PER_1M, TIER_CACHE_READ_USD_PER_1M, and TIER_CACHE_WRITE_USD_PER_1M. Legacy section_11 / step_nominal_cost_usd use output pricing only (TIER_OUTPUT_USD_PER_1M). The tables below mirror the shipped constants.
Public target_tier |
USD / 1M output tokens |
|---|---|
low |
0.5 |
mid |
2.0 |
mid_high |
5.0 |
high |
25.0 |
Public target_tier |
USD / 1M input | USD / 1M cache read | USD / 1M cache write |
|---|---|---|---|
low |
0.26 | 0.13 | 0.26 |
mid |
0.30 | 0.059 | 0.30 |
mid_high |
0.50 | 0.05 | 0.08333 |
high |
5.0 | 0.50 | 6.25 |
For tiers without a published cache-write price (low, mid), we conservatively assume cache write = base input price.
When computing costs from concrete model endpoints inside your harness, this library maps known model ids to these tiers and raises ValueError on unknown ids. That mapping lives in code only, not in the open JSONL.
Each line in data/question_bank.jsonl is one routing supervision step: a conversation prefix (messages) and a gold capability tier (target_tier / target_tier_id). Any router you plug in must produce a tier id in {0,1,2,3} for that step. The library scores predictions against gold using the rules below.
- Full bank —
run_question_bank_eval(..., n=None): every row, file order (~970 steps in the public build). - Fixed size, stratified by source — pass
--n N(CLI) orn=N(API): largest-remainder quotas bydata/manifest.jsonsources.*.line_count, then one-pass reservoir sampling per benchmark stratum (--seedfixes RNG). This keeps the five logical benchmarks (swebench,pinchbench,mtrag,qmsum,bfcl) in roughly the same ratio as the full corpus.
Report sample_mode, benchmark_counts, and by_benchmark from the eval JSON so others can reproduce your split.
For teams that choose to call a chat model behind an OpenAI-compatible HTTP API, this package exposes a digit-tier contract via OpenAICompatRouterClassifier and LlmDigitClassifierPredictor. That is a reference integration only—not a recommendation that LLM-based routing is preferable to rules, classical ML, or other designs.
The contract is:
- Linearize the row’s
messagesinto one user string (question_bank_messages_to_classifier_prompt). - Send one chat completion per row; the assistant message must be parseable as a single digit
0–3(optional surrounding whitespace; no extra lines or prose — seeparse_tier_response_to_id). - Call
run_question_bank_eval/evaluate_question_bank_rowsfrommain.evalfrom your own driver (load rows, call the predictor, aggregate JSON).
Implement a function f(row: dict) -> int that returns target_tier_id in 0..3 from the raw row (you may ignore messages or engineer features from them). Wrap it with FunctionPredictor and pass it to run_question_bank_eval or evaluate_question_bank_rows. No HTTP and no chat template are required; the same JSON summary and by_benchmark breakdown apply.
These metrics are computed by main.eval.
scores_v2 (top-level field in the eval summary, computed by compute_v2_scores) is the recommended headline: four orthogonal dimension scores plus their arithmetic mean. section_11 (legacy cost_savings_score, one step per row, output-token-only nominal cost) and router_accounting (trajectory-level D = Σ(baseline − gold) with N gated on trajectory pass/fail) are still emitted for backward compatibility but are superseded by scores_v2. Neither path requires running full benchmark tasks to completion.
| # | Field | Denominator | Definition |
|---|---|---|---|
| 1 | case_pass_rate_percent |
total rows | #{pred_tier_id >= gold_tier_id} over all rows (rows with error count as failures). |
| 2 | case_exact_match_percent |
total rows | #{pred_tier_id == gold_tier_id} over all rows. |
| 3 | trajectory_pass_rate_percent |
total rows | A row counts toward the numerator iff its entire trajectory passes (every step pred_tier_id >= gold_tier_id, no error). Row-weighted denominator makes this directly comparable to metric 1 and guarantees trajectory_pass_rate ≤ case_pass_rate. |
| 4 | cost_savings_score_percent |
USD ratio | Full-cost savings under the trajectory-level natural-accounting user-bill model (failed trajectory = router's whole chain wasted + one full always-high re-run of the chain; see "Cost savings formula" below). Macro-weighted across benchmarks by total row count. Range (−∞, 100], normally [0, 100]. |
| 5 | combined_score_percent |
— | Arithmetic mean of 1–4; NaN if any component is NaN. |
All gold tiers are included. The underlying physical model is a trajectory-level natural-accounting user bill:
- Passed trajectory (no
errorAND every steppred_tier_id >= gold_tier_id): user bill =Σ pred_cost. Savings vs always-high =Σ (baseline − pred). - Failed trajectory (any
errorOR any steppred_tier_id < gold_tier_id): the router's entire chain is wasted and the whole trajectory has to be re-run with always-high. User bill =Σ pred_cost (router's full original chain) + Σ baseline_cost (one full-high retry of the whole chain). Savings vs always-high reduce to exactly−Σ pred_cost.
Per evaluable step of benchmark b, using the same full four-category cost model as router_accounting (step_full_cost_usd, see "Nominal pricing"), accumulation is driven by trajectory-level pass/fail:
D_b += baseline_cost # baseline = always-high step bill
if trajectory_passed: # no error AND every step pred >= gold
N_b += baseline_cost - pred_cost # credit this step's savings
else: # trajectory failed — no step gets credit
N_b -= pred_cost # implicit: the Σ baseline full-high retry
# is already covered by D = Σ baseline
Note: inside a failed trajectory, steps that individually pass (pred >= gold) are not credited with step-level savings — the whole chain has to be re-run, so individual step correctness doesn't rescue the trajectory. There is no additional -Σ baseline retry penalty on top: the one physical full-high retry is already captured by the denominator.
Across benchmarks the score is macro-weighted by total row count (same scope as metric 1):
cost_savings_score_percent = Σ_b (rows_b / total_rows) × (100 × N_b / D_b)
The scores_v2.by_benchmark.<b> block reports row_count, step_count, failed_trajectory_count, failed_retry_baseline_usd (informational: Σ baseline over failed-trajectory evaluable steps), D_usd, N_usd, cost_savings_score_percent, and weight_in_global_cost_savings for each benchmark.
Still emitted in the eval summary for backward compatibility. New consumers should prefer the scores_v2 table above.
| Metric | Definition |
|---|---|
tier_match_accuracy |
Fraction of evaluable rows (no error) where pred_tier_id == gold_tier_id. Skipped rows are excluded from the denominator. Step-level (one row = one step). |
valid_response_rate |
Fraction of rows with a usable prediction (no recorded error). |
Pass (passed) |
pred_tier_id >= gold_tier_id (predicted tier is at least as capable as gold). Rows with error are not passed. Per row / step in summaries built from evaluate_question_bank_rows. |
pass_rate |
passed / sampled over all rows (same per-step semantics as passed). |
cost_savings_score (legacy section_11) |
Baseline = always high (tier id 3). For each passed row with gold strictly below high, nominal step cost uses output tokens only and a uniform positive completion length assumed_completion_tokens_per_routing_step (default 1_000_000) per row: cost(tier) = T × (output USD/1M for tier) / 10^6. Then save_gt = cost(high) − cost(gold), save_test = cost(high) − cost(pred). Score = 100 × Σ save_test / Σ save_gt over passed rows with save_gt > 0. |
Relation to task-level benchmarks: A task pass rate (e.g. whether a SWE-Bench instance is resolved) needs an end-to-end harness with executed trajectories. The question-bank eval here is the routing-supervision slice: it measures whether your router’s tier choice is sufficient (pass_rate) and how much nominal money it saves versus always using the highest tier under the stated assumptions (cost_savings_score and/or router_accounting).
Rows with the same instance_id form one trajectory (multi-turn supervision). step_index / total_steps order steps within that trajectory. Single-turn rows typically have total_steps == 1 and still carry an instance_id.
For router_accounting, evaluate_question_bank_rows and external merge tools (e.g. ClawRouter score_with_crb.py) should attach instance_id, step_index, total_steps, and messages to each per_row record so costs can be computed from the same prefixes the router saw.
Per-step costs for router_accounting count tokens from each row’s messages using native vendor tokenizers where available. The tokenizer for each tier is loaded by _load_tier_encoder (cached per tier):
| Tier | Tokenizer | Source |
|---|---|---|
high |
Anthropic native | Bundled JSON (main/tokenizer_data/anthropic_tokenizer.json) |
mid_high |
cl100k_base |
Gemini has no offline tokenizer; tiktoken fallback |
mid |
MiniMax native | HuggingFace MiniMaxAI/MiniMax-Text-01 |
low |
DeepSeek native | HuggingFace deepseek-ai/DeepSeek-V3 |
If the tokenizers package is not installed, all tiers fall back to tiktoken cl100k_base. Per-message overhead (+4 tokens per message, +2 priming) is applied uniformly to approximate chat-format bookkeeping costs.
- Semantic prefix check: consecutive-step
messagesare compared onrole,content(string or list-of-blocks;cache_controlinside blocks is ignored),tool_calls,tool_call_id, andname, so harmless serialization differences from upstream log export do not break cache accounting. - Prompt split (per path: baseline / gold / pred): baseline tier is always
high. A cold start (first step, tier switch, cache TTL exceeded, or prefix mismatch) bills the full prompt as cache write. If the tier is unchanged, the cache has not expired, and the previousmessagesare a semantic prefix of the current ones, the prefix is cache read and the delta is cache write. - Cache TTL: if the same tier was last called more than 3 global steps ago, the cache is considered expired and a full cache-write is triggered. This models realistic prompt-cache expiry in multi-step agent traces where steps may be interspersed with other tiers.
- Output tokens: for step i with a following step, estimated from
messagesdelta (assistant role only, includingtool_callsJSON), using that step’s gold tier tokenizer. The last step in a trajectory uses the trajectory’s average of those estimates when available, elsefallback_output_tokens(seerouter_accountingJSON field).
Still emitted in the eval summary for backward compatibility. Superseded by scores_v2 (which keeps trajectory-level pass/fail but switches to D = Σ baseline and uses a trajectory-level natural-accounting numerator where a failed trajectory loses all step-level savings credit — an implicit one-time full-high retry of the whole chain, no additional penalty coefficient).
Computed in compute_router_accounting_metrics (main.eval.section11). Steps with error are excluded from evaluable_step_count and from D_usd / N_usd (they never enter the per-step cost loop). Any trajectory that contains error on at least one step is failed for pass_rate_percent and exact_match_rate_percent (those steps still set has_error and clear trajectory pass/exact flags).
Trajectory pass: no step has error, and every evaluable step satisfies pred_tier_id >= gold_tier_id.
Trajectory exact: trajectory pass and every evaluable step has pred_tier_id == gold_tier_id.
Savings numerator N_usd and denominator D_usd: summed over evaluable steps only (steps without error, with int pred_tier_id / gold_tier_id). For each such step, baseline_cost, gold_cost, and pred_cost are the full four-category USD costs at the respective tiers (see Nominal pricing). D_usd += baseline_cost − gold_cost. If the trajectory passes, N_usd += baseline_cost − pred_cost on each evaluable step; if the trajectory fails (any error or any pred < gold), N_usd -= pred_cost on every evaluable step in that trajectory (all predicted routing spend counts against N).
| Field | Definition |
|---|---|
total_trajectories |
Count of distinct instance_id groups in the scored row list. |
passed_trajectories / exact_match_trajectories |
Trajectory-level pass / all-step exact counts. |
evaluable_step_count |
Steps without error with int tier ids (contribute to D_usd / N_usd). |
skipped_step_count |
Rows with error. |
D_usd |
Σ (baseline_cost − gold_cost) over evaluable steps. |
N_usd |
As above (pass vs fail trajectory rules). |
pass_rate_percent |
100 × passed_trajectories / total_trajectories. NaN if no trajectories. |
exact_match_rate_percent |
100 × exact_match_trajectories / total_trajectories. Not the same as top-level tier_match_accuracy (which remains step-level exact rate). NaN if no trajectories. |
accounting_savings_score_percent |
100 × N_usd / D_usd when D_usd > 0; NaN if D_usd == 0 or there are no trajectories. |
overall_score_percent |
Mean of pass_rate_percent, exact_match_rate_percent, accounting_savings_score_percent; NaN if any component is NaN. |
fallback_output_tokens |
Constant used when output tokens cannot be inferred from messages deltas. |
Top-level tier_match_accuracy (0–1) and accuracy_excluding_errors remain step-level exact-match rates (same value). by_benchmark / exact_match counts are also step-level (rows with match).
from main import iter_question_bank, iter_routing_supervision
# Full bank (single file data/question_bank.jsonl)
for row in iter_question_bank():
...
# Only rows whose benchmark field is "swebench" (same as iter_routing_supervision("swebench"))
for row in iter_routing_supervision("swebench"):
messages = row["messages"]
tier = row["target_tier"]
tier_id = row["target_tier_id"]from main.metrics import CaseMetrics, aggregate_routerbench_metrics
cases = [
CaseMetrics(
case_id="a",
task_passed=True,
baseline_cost_nominal=1.0,
optimal_cost_nominal=0.4,
test_cost_nominal=0.5,
),
]
summary = aggregate_routerbench_metrics(cases)from main.metrics import routing_supervision_accuracy
acc = routing_supervision_accuracy(gold_rows, predictions_by_id)OpenAICompatRouterClassifier sends one case per request: system (plain string, or Anthropic-style cached block list when system_prompt_cache is on / auto+Claude) plus one user message whose content is a string (your full case text). The model must reply with exactly one character 0–3 (target_tier_id: low→0, mid→1, mid_high→2, high→3). Responses containing newlines or extra text raise ValueError on parse.
from main import OpenAICompatRouterClassifier, question_bank_messages_to_classifier_prompt
clf = OpenAICompatRouterClassifier(
base_url="https://api.example.com/v1",
api_key="...",
model="deepseek/deepseek-v3.2",
system_prompt_cache="auto",
)
prompt = question_bank_messages_to_classifier_prompt(row["messages"])
result = clf.predict_tier_id(prompt)
assert result.tier_id == row["target_tier_id"]Lower-level helpers: parse_tier_response_to_id, build_system_content, post_chat_completions, chat_completions_url. Default instructions live in DEFAULT_ROUTER_SYSTEM_INSTRUCTION.
Programmatic entry point for sampling, scoring, and pluggable predictors (FunctionPredictor, LlmDigitClassifierPredictor, or any QuestionBankRouterPredictor). See Benchmark usage and Scoring rules for semantics.
Implement QuestionBankRouterPredictor (method predict(row) -> TierPrediction) or use:
FunctionPredictor: wraps anycallable(row: dict) -> int(heuristics, sklearnpredict, etc.); no chat prompt.LlmDigitClassifierPredictor: optional OpenAI-compat chat wrapper aroundOpenAICompatRouterClassifierandquestion_bank_messages_to_classifier_prompt.
from main.eval import (
FunctionPredictor,
LlmDigitClassifierPredictor,
run_question_bank_eval,
evaluate_question_bank_rows,
build_eval_summary,
select_question_bank_rows,
)
# Rules / sklearn-style: tier_id only from the row (example: always use gold — not a real model)
oracle = FunctionPredictor(lambda row: row["target_tier_id"])
rows, sample_mode, quotas = select_question_bank_rows(n=20, seed=1)
per_row, errors, correct = evaluate_question_bank_rows(
oracle, rows, predictor_label="oracle_gold"
)
summary = build_eval_summary(
per_row=per_row,
errors=errors,
correct=correct,
predictor_label="oracle_gold",
shard="data/question_bank.jsonl",
sample_mode=sample_mode,
seed=1,
proportional_quotas=quotas,
)
# One-shot (loads bank from package data paths):
# summary = run_question_bank_eval(oracle, predictor_label="oracle_gold", n=20, seed=1)Public helpers also include manifest_proportional_quotas, proportional_reservoir_sample, load_all_question_bank_rows, compute_section11, compute_router_accounting_metrics, compute_v2_scores, and aggregate_by_benchmark.
python -m main.cli metrics --cases path/to/cases.json
CommonRouterBench metrics --cases path/to/cases.jsonWhen calling OpenAICompatRouterClassifier from your application, configure the API with environment variables or your own config layer. .env.example lists common variable names (OPENROUTER_* or OPENAI_* / API_KEY + BASE_URL); the client expects a base URL that already includes /v1.
- Add
[project.urls]topyproject.toml(Homepage,Repository, etc.) before uploading to PyPI so the project page links resolve. - Ensure
data/question_bank.jsonlanddata/manifest.jsonexist if you want them inside the built wheel (see[tool.setuptools.package-data]inpyproject.toml). - Bump
versioninpyproject.tomland append a section toCHANGELOG.md. - Build and upload:
pip install build twine
python -m build
twine check dist/*
twine upload dist/*Naming reminder: the PyPI / pip distribution is CommonRouterBench; the only shipped import top-level package is main. Avoid shadowing main in small throwaway scripts (e.g. do not name your module main.py next to snippets that import main).
Apache-2.0 (see LICENSE and pyproject.toml). Third-party benchmark data may carry separate licenses.