CommonRouterBench

Routing supervision gold labels for workloads that actually burn tokens.

Chinese review copy: README.zh.md

CommonRouterBench provides per-step LLM routing gold labels for scenarios where routing actually matters: multi-step agents, long-context RAG, and vibecoding loops. Instead of relying on flawed LLM-as-judge scores or single-shot questions, our labels are generated by analyzing successful traces and performing a sequential downgrade search to find the cheapest capability tier that still passes the rigorous task checks.

If you are building routers for high-consumption, multi-turn scenarios, you can use this dataset to evaluate your router or train tier predictors using realistic prefixes (conversation + tool outputs + diffs).

Quick Start

1. Install the package

# From this directory (editable, for development):
pip install -e .

# Or from PyPI (once published):
pip install CommonRouterBench

(Note: The PyPI distribution is CommonRouterBench, but you import main in your code.)

2. Evaluate a router

from main.eval import FunctionPredictor, run_question_bank_eval

# Example: an oracle that always predicts the gold tier
oracle = FunctionPredictor(lambda row: row["target_tier_id"])
summary = run_question_bank_eval(oracle, n=20, seed=1)
# Headline: four component scores plus their arithmetic mean.
print(summary["scores_v2"])

About this distribution

What is shipped: This directory is the only part of the repository intended for open source. Published artifacts ship the main package, the data/ question bank, and documentation. Private test harnesses are excluded.
Version: 0.1.0 (see CHANGELOG.md).
Dependencies: The core package depends on requests (HTTP helpers in main.router_llm), tiktoken (fallback token counting), and tokenizers (HuggingFace — native vendor tokenizers for main.tokenizer).
Local tests: You may keep a tests/ directory beside pyproject.toml for pytest; it is .gitignored.

Project layout

Path	Purpose
`main/`	Published Python package (`import main`).
`data/`	`question_bank.jsonl` and `manifest.json` (bundled in wheels when present at build time).

Regenerating data/ from private benchmark exports is out of scope for the published package; keep your own merge tooling outside this tree if needed.

Data layout

Artifacts under data/:

data/question_bank.jsonl — all routing-step records in one file (no per-benchmark subdirectories).
data/manifest.json — per-source line counts and schema metadata.

Each line includes a string field benchmark (e.g. swebench, mtrag) for filtering.

Open corpus: tier-only targets (no model IDs)

The question bank does not include fields such as optimal_model or baseline_model. Supervision is only by capability tier, using English labels and a numeric id.

`target_tier` (string)	`target_tier_id` (int)	Tier (plain English)
`low`	0	Low
`mid`	1	Medium
`mid_high`	2	Medium-high
`high`	3	High

Chinese tier label equivalents for the English target_tier strings are summarized in README.zh.md.

Each line includes at least: id, benchmark, scenario, instance_id, step_index, total_steps, messages, target_tier, target_tier_id.

Data distribution

The following counts match the data/question_bank.jsonl and data/manifest.json shipped in this repository (970 routing-step rows). Rebuilding the bank from private exports may change these figures.

For BFCL, the public corpus now includes both single-turn and multi-turn routing supervision rows.

Rows by `benchmark`

`benchmark`	Rows	Share of bank
`swebench`	336	34.6%
`bfcl`	248	25.6%
`mtrag`	193	19.9%
`qmsum`	145	14.9%
`pinchbench`	48	4.9%
Total	970	100%

Gold `target_tier` (full bank)

`target_tier`	`target_tier_id`	Rows	Share
`low`	0	689	71.0%
`mid`	1	62	6.4%
`mid_high`	2	49	5.1%
`high`	3	170	17.5%
Total	—	970	100%

Gold `target_tier` by `benchmark` (row counts)

`benchmark`	Rows	`low`	`mid`	`mid_high`	`high`
`bfcl`	248	239	8	1	0
`mtrag`	193	183	8	1	1
`pinchbench`	48	41	3	3	1
`qmsum`	145	132	10	3	0
`swebench`	336	94	33	41	168

Nominal pricing (USD per 1M tokens)

Authoritative values live in main.pricing: TIER_OUTPUT_USD_PER_1M, TIER_INPUT_USD_PER_1M, TIER_CACHE_READ_USD_PER_1M, and TIER_CACHE_WRITE_USD_PER_1M. Legacy section_11 / step_nominal_cost_usd use output pricing only (TIER_OUTPUT_USD_PER_1M). The tables below mirror the shipped constants.

Output (completion) tokens

Public `target_tier`	USD / 1M output tokens
`low`	0.5
`mid`	2.0
`mid_high`	5.0
`high`	25.0

Input, cache read, and cache write (used by `router_accounting` only)

Public `target_tier`	USD / 1M input	USD / 1M cache read	USD / 1M cache write
`low`	0.26	0.13	0.26
`mid`	0.30	0.059	0.30
`mid_high`	0.50	0.05	0.08333
`high`	5.0	0.50	6.25

For tiers without a published cache-write price (low, mid), we conservatively assume cache write = base input price.

When computing costs from concrete model endpoints inside your harness, this library maps known model ids to these tiers and raises ValueError on unknown ids. That mapping lives in code only, not in the open JSONL.

Benchmark usage: wiring predictors and scoring

Each line in data/question_bank.jsonl is one routing supervision step: a conversation prefix (messages) and a gold capability tier (target_tier / target_tier_id). Any router you plug in must produce a tier id in {0,1,2,3} for that step. The library scores predictions against gold using the rules below.

Sampling

Full bank — run_question_bank_eval(..., n=None): every row, file order (~970 steps in the public build).
Fixed size, stratified by source — pass --n N (CLI) or n=N (API): largest-remainder quotas by data/manifest.json sources.*.line_count, then one-pass reservoir sampling per benchmark stratum (--seed fixes RNG). This keeps the five logical benchmarks (swebench, pinchbench, mtrag, qmsum, bfcl) in roughly the same ratio as the full corpus.

Report sample_mode, benchmark_counts, and by_benchmark from the eval JSON so others can reproduce your split.

OpenAI-compatible chat hook (single-digit tier output)

For teams that choose to call a chat model behind an OpenAI-compatible HTTP API, this package exposes a digit-tier contract via OpenAICompatRouterClassifier and LlmDigitClassifierPredictor. That is a reference integration only—not a recommendation that LLM-based routing is preferable to rules, classical ML, or other designs.

The contract is:

Linearize the row’s messages into one user string (question_bank_messages_to_classifier_prompt).
Send one chat completion per row; the assistant message must be parseable as a single digit 0–3 (optional surrounding whitespace; no extra lines or prose — see parse_tier_response_to_id).
Call run_question_bank_eval / evaluate_question_bank_rows from main.eval from your own driver (load rows, call the predictor, aggregate JSON).

Arbitrary predictors (rules, sklearn, etc.)

Implement a function f(row: dict) -> int that returns target_tier_id in 0..3 from the raw row (you may ignore messages or engineer features from them). Wrap it with FunctionPredictor and pass it to run_question_bank_eval or evaluate_question_bank_rows. No HTTP and no chat template are required; the same JSON summary and by_benchmark breakdown apply.

Scoring rules (routing-step evaluation)

These metrics are computed by main.eval.

scores_v2 (top-level field in the eval summary, computed by compute_v2_scores) is the recommended headline: four orthogonal dimension scores plus their arithmetic mean. section_11 (legacy cost_savings_score, one step per row, output-token-only nominal cost) and router_accounting (trajectory-level D = Σ(baseline − gold) with N gated on trajectory pass/fail) are still emitted for backward compatibility but are superseded by scores_v2. Neither path requires running full benchmark tasks to completion.

Headline metrics (`scores_v2`)

#	Field	Denominator	Definition
1	`case_pass_rate_percent`	total rows	`#{pred_tier_id >= gold_tier_id}` over all rows (rows with `error` count as failures).
2	`case_exact_match_percent`	total rows	`#{pred_tier_id == gold_tier_id}` over all rows.
3	`trajectory_pass_rate_percent`	total rows	A row counts toward the numerator iff its entire trajectory passes (every step `pred_tier_id >= gold_tier_id`, no `error`). Row-weighted denominator makes this directly comparable to metric 1 and guarantees `trajectory_pass_rate ≤ case_pass_rate`.
4	`cost_savings_score_percent`	USD ratio	Full-cost savings under the trajectory-level natural-accounting user-bill model (failed trajectory = router's whole chain wasted + one full always-high re-run of the chain; see "Cost savings formula" below). Macro-weighted across benchmarks by total row count. Range `(−∞, 100]`, normally `[0, 100]`.
5	`combined_score_percent`	—	Arithmetic mean of 1–4; NaN if any component is NaN.

Cost savings formula (metric 4)

All gold tiers are included. The underlying physical model is a trajectory-level natural-accounting user bill:

Passed trajectory (no error AND every step pred_tier_id >= gold_tier_id): user bill = Σ pred_cost. Savings vs always-high = Σ (baseline − pred).
Failed trajectory (any error OR any step pred_tier_id < gold_tier_id): the router's entire chain is wasted and the whole trajectory has to be re-run with always-high. User bill = Σ pred_cost (router's full original chain) + Σ baseline_cost (one full-high retry of the whole chain). Savings vs always-high reduce to exactly −Σ pred_cost.

Per evaluable step of benchmark b, using the same full four-category cost model as router_accounting (step_full_cost_usd, see "Nominal pricing"), accumulation is driven by trajectory-level pass/fail:

D_b  += baseline_cost                                  # baseline = always-high step bill
if trajectory_passed:                                  # no error AND every step pred >= gold
    N_b += baseline_cost - pred_cost                   # credit this step's savings
else:                                                  # trajectory failed — no step gets credit
    N_b -= pred_cost                                   # implicit: the Σ baseline full-high retry
                                                       # is already covered by D = Σ baseline

Note: inside a failed trajectory, steps that individually pass (pred >= gold) are not credited with step-level savings — the whole chain has to be re-run, so individual step correctness doesn't rescue the trajectory. There is no additional -Σ baseline retry penalty on top: the one physical full-high retry is already captured by the denominator.

Across benchmarks the score is macro-weighted by total row count (same scope as metric 1):

cost_savings_score_percent = Σ_b (rows_b / total_rows) × (100 × N_b / D_b)

The scores_v2.by_benchmark.<b> block reports row_count, step_count, failed_trajectory_count, failed_retry_baseline_usd (informational: Σ baseline over failed-trajectory evaluable steps), D_usd, N_usd, cost_savings_score_percent, and weight_in_global_cost_savings for each benchmark.

Legacy per-row / per-step fields (`section_11`)

Still emitted in the eval summary for backward compatibility. New consumers should prefer the scores_v2 table above.

Metric	Definition
`tier_match_accuracy`	Fraction of evaluable rows (no `error`) where `pred_tier_id == gold_tier_id`. Skipped rows are excluded from the denominator. Step-level (one row = one step).
`valid_response_rate`	Fraction of rows with a usable prediction (no recorded `error`).
Pass (`passed`)	`pred_tier_id >= gold_tier_id` (predicted tier is at least as capable as gold). Rows with `error` are not passed. Per row / step in summaries built from `evaluate_question_bank_rows`.
`pass_rate`	`passed / sampled` over all rows (same per-step semantics as `passed`).
`cost_savings_score` (legacy `section_11`)	Baseline = always `high` (tier id 3). For each passed row with gold strictly below `high`, nominal step cost uses output tokens only and a uniform positive completion length `assumed_completion_tokens_per_routing_step` (default 1_000_000) per row: `cost(tier) = T × (output USD/1M for tier) / 10^6`. Then `save_gt = cost(high) − cost(gold)`, `save_test = cost(high) − cost(pred)`. Score = `100 × Σ save_test / Σ save_gt` over passed rows with `save_gt > 0`.

Relation to task-level benchmarks: A task pass rate (e.g. whether a SWE-Bench instance is resolved) needs an end-to-end harness with executed trajectories. The question-bank eval here is the routing-supervision slice: it measures whether your router’s tier choice is sufficient (pass_rate) and how much nominal money it saves versus always using the highest tier under the stated assumptions (cost_savings_score and/or router_accounting).

Trajectories (`instance_id`)

Rows with the same instance_id form one trajectory (multi-turn supervision). step_index / total_steps order steps within that trajectory. Single-turn rows typically have total_steps == 1 and still carry an instance_id.

For router_accounting, evaluate_question_bank_rows and external merge tools (e.g. ClawRouter score_with_crb.py) should attach instance_id, step_index, total_steps, and messages to each per_row record so costs can be computed from the same prefixes the router saw.

Token counting (`main.tokenizer`)

Per-step costs for router_accounting count tokens from each row’s messages using native vendor tokenizers where available. The tokenizer for each tier is loaded by _load_tier_encoder (cached per tier):

Tier	Tokenizer	Source
`high`	Anthropic native	Bundled JSON (`main/tokenizer_data/anthropic_tokenizer.json`)
`mid_high`	`cl100k_base`	Gemini has no offline tokenizer; `tiktoken` fallback
`mid`	MiniMax native	HuggingFace `MiniMaxAI/MiniMax-Text-01`
`low`	DeepSeek native	HuggingFace `deepseek-ai/DeepSeek-V3`

If the tokenizers package is not installed, all tiers fall back to tiktoken cl100k_base. Per-message overhead (+4 tokens per message, +2 priming) is applied uniformly to approximate chat-format bookkeeping costs.

Semantic prefix check: consecutive-step messages are compared on role, content (string or list-of-blocks; cache_control inside blocks is ignored), tool_calls, tool_call_id, and name, so harmless serialization differences from upstream log export do not break cache accounting.
Prompt split (per path: baseline / gold / pred): baseline tier is always high. A cold start (first step, tier switch, cache TTL exceeded, or prefix mismatch) bills the full prompt as cache write. If the tier is unchanged, the cache has not expired, and the previous messages are a semantic prefix of the current ones, the prefix is cache read and the delta is cache write.
Cache TTL: if the same tier was last called more than 3 global steps ago, the cache is considered expired and a full cache-write is triggered. This models realistic prompt-cache expiry in multi-step agent traces where steps may be interspersed with other tiers.
Output tokens: for step i with a following step, estimated from messages delta (assistant role only, including tool_calls JSON), using that step’s gold tier tokenizer. The last step in a trajectory uses the trajectory’s average of those estimates when available, else fallback_output_tokens (see router_accounting JSON field).

Legacy trajectory-level fields (`router_accounting`)

Still emitted in the eval summary for backward compatibility. Superseded by scores_v2 (which keeps trajectory-level pass/fail but switches to D = Σ baseline and uses a trajectory-level natural-accounting numerator where a failed trajectory loses all step-level savings credit — an implicit one-time full-high retry of the whole chain, no additional penalty coefficient).

Computed in compute_router_accounting_metrics (main.eval.section11). Steps with error are excluded from evaluable_step_count and from D_usd / N_usd (they never enter the per-step cost loop). Any trajectory that contains error on at least one step is failed for pass_rate_percent and exact_match_rate_percent (those steps still set has_error and clear trajectory pass/exact flags).

Trajectory pass: no step has error, and every evaluable step satisfies pred_tier_id >= gold_tier_id.

Trajectory exact: trajectory pass and every evaluable step has pred_tier_id == gold_tier_id.

Savings numerator N_usd and denominator D_usd: summed over evaluable steps only (steps without error, with int pred_tier_id / gold_tier_id). For each such step, baseline_cost, gold_cost, and pred_cost are the full four-category USD costs at the respective tiers (see Nominal pricing). D_usd += baseline_cost − gold_cost. If the trajectory passes, N_usd += baseline_cost − pred_cost on each evaluable step; if the trajectory fails (any error or any pred < gold), N_usd -= pred_cost on every evaluable step in that trajectory (all predicted routing spend counts against N).

Field	Definition
`total_trajectories`	Count of distinct `instance_id` groups in the scored row list.
`passed_trajectories` / `exact_match_trajectories`	Trajectory-level pass / all-step exact counts.
`evaluable_step_count`	Steps without `error` with int tier ids (contribute to `D_usd` / `N_usd`).
`skipped_step_count`	Rows with `error`.
`D_usd`	`Σ (baseline_cost − gold_cost)` over evaluable steps.
`N_usd`	As above (pass vs fail trajectory rules).
`pass_rate_percent`	`100 × passed_trajectories / total_trajectories`. NaN if no trajectories.
`exact_match_rate_percent`	`100 × exact_match_trajectories / total_trajectories`. Not the same as top-level `tier_match_accuracy` (which remains step-level exact rate). NaN if no trajectories.
`accounting_savings_score_percent`	`100 × N_usd / D_usd` when `D_usd > 0`; NaN if `D_usd == 0` or there are no trajectories.
`overall_score_percent`	Mean of `pass_rate_percent`, `exact_match_rate_percent`, `accounting_savings_score_percent`; NaN if any component is NaN.
`fallback_output_tokens`	Constant used when output tokens cannot be inferred from `messages` deltas.

Top-level tier_match_accuracy (0–1) and accuracy_excluding_errors remain step-level exact-match rates (same value). by_benchmark / exact_match counts are also step-level (rows with match).

Python API

from main import iter_question_bank, iter_routing_supervision

# Full bank (single file data/question_bank.jsonl)
for row in iter_question_bank():
    ...

# Only rows whose benchmark field is "swebench" (same as iter_routing_supervision("swebench"))
for row in iter_routing_supervision("swebench"):
    messages = row["messages"]
    tier = row["target_tier"]
    tier_id = row["target_tier_id"]

from main.metrics import CaseMetrics, aggregate_routerbench_metrics

cases = [
    CaseMetrics(
        case_id="a",
        task_passed=True,
        baseline_cost_nominal=1.0,
        optimal_cost_nominal=0.4,
        test_cost_nominal=0.5,
    ),
]
summary = aggregate_routerbench_metrics(cases)

from main.metrics import routing_supervision_accuracy

acc = routing_supervision_accuracy(gold_rows, predictions_by_id)

Router LLM API (OpenAI-compatible chat completions)

OpenAICompatRouterClassifier sends one case per request: system (plain string, or Anthropic-style cached block list when system_prompt_cache is on / auto+Claude) plus one user message whose content is a string (your full case text). The model must reply with exactly one character 0–3 (target_tier_id: low→0, mid→1, mid_high→2, high→3). Responses containing newlines or extra text raise ValueError on parse.

from main import OpenAICompatRouterClassifier, question_bank_messages_to_classifier_prompt

clf = OpenAICompatRouterClassifier(
    base_url="https://api.example.com/v1",
    api_key="...",
    model="deepseek/deepseek-v3.2",
    system_prompt_cache="auto",
)
prompt = question_bank_messages_to_classifier_prompt(row["messages"])
result = clf.predict_tier_id(prompt)
assert result.tier_id == row["target_tier_id"]

Lower-level helpers: parse_tier_response_to_id, build_system_content, post_chat_completions, chat_completions_url. Default instructions live in DEFAULT_ROUTER_SYSTEM_INSTRUCTION.

Question-bank evaluation (`main.eval`)

Programmatic entry point for sampling, scoring, and pluggable predictors (FunctionPredictor, LlmDigitClassifierPredictor, or any QuestionBankRouterPredictor). See Benchmark usage and Scoring rules for semantics.

Implement QuestionBankRouterPredictor (method predict(row) -> TierPrediction) or use:

FunctionPredictor: wraps any callable(row: dict) -> int (heuristics, sklearn predict, etc.); no chat prompt.
LlmDigitClassifierPredictor: optional OpenAI-compat chat wrapper around OpenAICompatRouterClassifier and question_bank_messages_to_classifier_prompt.

from main.eval import (
    FunctionPredictor,
    LlmDigitClassifierPredictor,
    run_question_bank_eval,
    evaluate_question_bank_rows,
    build_eval_summary,
    select_question_bank_rows,
)

# Rules / sklearn-style: tier_id only from the row (example: always use gold — not a real model)
oracle = FunctionPredictor(lambda row: row["target_tier_id"])
rows, sample_mode, quotas = select_question_bank_rows(n=20, seed=1)
per_row, errors, correct = evaluate_question_bank_rows(
    oracle, rows, predictor_label="oracle_gold"
)
summary = build_eval_summary(
    per_row=per_row,
    errors=errors,
    correct=correct,
    predictor_label="oracle_gold",
    shard="data/question_bank.jsonl",
    sample_mode=sample_mode,
    seed=1,
    proportional_quotas=quotas,
)

# One-shot (loads bank from package data paths):
# summary = run_question_bank_eval(oracle, predictor_label="oracle_gold", n=20, seed=1)

Public helpers also include manifest_proportional_quotas, proportional_reservoir_sample, load_all_question_bank_rows, compute_section11, compute_router_accounting_metrics, compute_v2_scores, and aggregate_by_benchmark.

CLI

python -m main.cli metrics --cases path/to/cases.json
CommonRouterBench metrics --cases path/to/cases.json

When calling OpenAICompatRouterClassifier from your application, configure the API with environment variables or your own config layer. .env.example lists common variable names (OPENROUTER_* or OPENAI_* / API_KEY + BASE_URL); the client expects a base URL that already includes /v1.

Publishing (maintainers)

Add [project.urls] to pyproject.toml (Homepage, Repository, etc.) before uploading to PyPI so the project page links resolve.
Ensure data/question_bank.jsonl and data/manifest.json exist if you want them inside the built wheel (see [tool.setuptools.package-data] in pyproject.toml).
Bump version in pyproject.toml and append a section to CHANGELOG.md.
Build and upload:

pip install build twine
python -m build
twine check dist/*
twine upload dist/*

Naming reminder: the PyPI / pip distribution is CommonRouterBench; the only shipped import top-level package is main. Avoid shadowing main in small throwaway scripts (e.g. do not name your module main.py next to snippets that import main).

License

Apache-2.0 (see LICENSE and pyproject.toml). Third-party benchmark data may carry separate licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
main		main
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README.zh.md		README.zh.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

CommonRouterBench

Quick Start

About this distribution

Project layout

Data layout

Open corpus: tier-only targets (no model IDs)

Data distribution

Rows by benchmark

Gold target_tier (full bank)

Gold target_tier by benchmark (row counts)

Nominal pricing (USD per 1M tokens)

Output (completion) tokens

Input, cache read, and cache write (used by router_accounting only)

Benchmark usage: wiring predictors and scoring

Sampling

OpenAI-compatible chat hook (single-digit tier output)

Arbitrary predictors (rules, sklearn, etc.)

Scoring rules (routing-step evaluation)

Headline metrics (scores_v2)

Cost savings formula (metric 4)

Legacy per-row / per-step fields (section_11)

Trajectories (instance_id)

Token counting (main.tokenizer)

Legacy trajectory-level fields (router_accounting)

Python API

Router LLM API (OpenAI-compatible chat completions)

Question-bank evaluation (main.eval)

CLI

Publishing (maintainers)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Rows by `benchmark`

Gold `target_tier` (full bank)

Gold `target_tier` by `benchmark` (row counts)

Input, cache read, and cache write (used by `router_accounting` only)

Headline metrics (`scores_v2`)

Legacy per-row / per-step fields (`section_11`)

Trajectories (`instance_id`)

Token counting (`main.tokenizer`)

Legacy trajectory-level fields (`router_accounting`)

Question-bank evaluation (`main.eval`)

Packages