Skip to content

CommonstackAI/CommonRouterBench

Repository files navigation

CommonRouterBench

Routing supervision gold labels for workloads that actually burn tokens.

Chinese review copy: README.zh.md

CommonRouterBench provides per-step LLM routing gold labels for scenarios where routing actually matters: multi-step agents, long-context RAG, and vibecoding loops. Instead of relying on flawed LLM-as-judge scores or single-shot questions, our labels are generated by analyzing successful traces and performing a sequential downgrade search to find the cheapest capability tier that still passes the rigorous task checks.

If you are building routers for high-consumption, multi-turn scenarios, you can use this dataset to evaluate your router or train tier predictors using realistic prefixes (conversation + tool outputs + diffs).

Quick Start

1. Install the package

# From this directory (editable, for development):
pip install -e .

# Or from PyPI (once published):
pip install CommonRouterBench

(Note: The PyPI distribution is CommonRouterBench, but you import main in your code.)

2. Evaluate a router

from main.eval import FunctionPredictor, run_question_bank_eval

# Example: an oracle that always predicts the gold tier
oracle = FunctionPredictor(lambda row: row["target_tier_id"])
summary = run_question_bank_eval(oracle, n=20, seed=1)
# Headline: four component scores plus their arithmetic mean.
print(summary["scores_v2"])

About this distribution

  • What is shipped: This directory is the only part of the repository intended for open source. Published artifacts ship the main package, the data/ question bank, and documentation. Private test harnesses are excluded.
  • Version: 0.1.0 (see CHANGELOG.md).
  • Dependencies: The core package depends on requests (HTTP helpers in main.router_llm), tiktoken (fallback token counting), and tokenizers (HuggingFace — native vendor tokenizers for main.tokenizer).
  • Local tests: You may keep a tests/ directory beside pyproject.toml for pytest; it is .gitignored.

Project layout

Path Purpose
main/ Published Python package (import main).
data/ question_bank.jsonl and manifest.json (bundled in wheels when present at build time).

Regenerating data/ from private benchmark exports is out of scope for the published package; keep your own merge tooling outside this tree if needed.

Data layout

Artifacts under data/:

  • data/question_bank.jsonl — all routing-step records in one file (no per-benchmark subdirectories).
  • data/manifest.json — per-source line counts and schema metadata.

Each line includes a string field benchmark (e.g. swebench, mtrag) for filtering.

Open corpus: tier-only targets (no model IDs)

The question bank does not include fields such as optimal_model or baseline_model. Supervision is only by capability tier, using English labels and a numeric id.

target_tier (string) target_tier_id (int) Tier (plain English)
low 0 Low
mid 1 Medium
mid_high 2 Medium-high
high 3 High

Chinese tier label equivalents for the English target_tier strings are summarized in README.zh.md.

Each line includes at least: id, benchmark, scenario, instance_id, step_index, total_steps, messages, target_tier, target_tier_id.

Data distribution

The following counts match the data/question_bank.jsonl and data/manifest.json shipped in this repository (970 routing-step rows). Rebuilding the bank from private exports may change these figures.

For BFCL, the public corpus now includes both single-turn and multi-turn routing supervision rows.

Rows by benchmark

benchmark Rows Share of bank
swebench 336 34.6%
bfcl 248 25.6%
mtrag 193 19.9%
qmsum 145 14.9%
pinchbench 48 4.9%
Total 970 100%

Gold target_tier (full bank)

target_tier target_tier_id Rows Share
low 0 689 71.0%
mid 1 62 6.4%
mid_high 2 49 5.1%
high 3 170 17.5%
Total 970 100%

Gold target_tier by benchmark (row counts)

benchmark Rows low mid mid_high high
bfcl 248 239 8 1 0
mtrag 193 183 8 1 1
pinchbench 48 41 3 3 1
qmsum 145 132 10 3 0
swebench 336 94 33 41 168

Nominal pricing (USD per 1M tokens)

Authoritative values live in main.pricing: TIER_OUTPUT_USD_PER_1M, TIER_INPUT_USD_PER_1M, TIER_CACHE_READ_USD_PER_1M, and TIER_CACHE_WRITE_USD_PER_1M. Legacy section_11 / step_nominal_cost_usd use output pricing only (TIER_OUTPUT_USD_PER_1M). The tables below mirror the shipped constants.

Output (completion) tokens

Public target_tier USD / 1M output tokens
low 0.5
mid 2.0
mid_high 5.0
high 25.0

Input, cache read, and cache write (used by router_accounting only)

Public target_tier USD / 1M input USD / 1M cache read USD / 1M cache write
low 0.26 0.13 0.26
mid 0.30 0.059 0.30
mid_high 0.50 0.05 0.08333
high 5.0 0.50 6.25

For tiers without a published cache-write price (low, mid), we conservatively assume cache write = base input price.

When computing costs from concrete model endpoints inside your harness, this library maps known model ids to these tiers and raises ValueError on unknown ids. That mapping lives in code only, not in the open JSONL.

Benchmark usage: wiring predictors and scoring

Each line in data/question_bank.jsonl is one routing supervision step: a conversation prefix (messages) and a gold capability tier (target_tier / target_tier_id). Any router you plug in must produce a tier id in {0,1,2,3} for that step. The library scores predictions against gold using the rules below.

Sampling

  • Full bankrun_question_bank_eval(..., n=None): every row, file order (~970 steps in the public build).
  • Fixed size, stratified by source — pass --n N (CLI) or n=N (API): largest-remainder quotas by data/manifest.json sources.*.line_count, then one-pass reservoir sampling per benchmark stratum (--seed fixes RNG). This keeps the five logical benchmarks (swebench, pinchbench, mtrag, qmsum, bfcl) in roughly the same ratio as the full corpus.

Report sample_mode, benchmark_counts, and by_benchmark from the eval JSON so others can reproduce your split.

OpenAI-compatible chat hook (single-digit tier output)

For teams that choose to call a chat model behind an OpenAI-compatible HTTP API, this package exposes a digit-tier contract via OpenAICompatRouterClassifier and LlmDigitClassifierPredictor. That is a reference integration only—not a recommendation that LLM-based routing is preferable to rules, classical ML, or other designs.

The contract is:

  1. Linearize the row’s messages into one user string (question_bank_messages_to_classifier_prompt).
  2. Send one chat completion per row; the assistant message must be parseable as a single digit 03 (optional surrounding whitespace; no extra lines or prose — see parse_tier_response_to_id).
  3. Call run_question_bank_eval / evaluate_question_bank_rows from main.eval from your own driver (load rows, call the predictor, aggregate JSON).

Arbitrary predictors (rules, sklearn, etc.)

Implement a function f(row: dict) -> int that returns target_tier_id in 0..3 from the raw row (you may ignore messages or engineer features from them). Wrap it with FunctionPredictor and pass it to run_question_bank_eval or evaluate_question_bank_rows. No HTTP and no chat template are required; the same JSON summary and by_benchmark breakdown apply.

Scoring rules (routing-step evaluation)

These metrics are computed by main.eval.

scores_v2 (top-level field in the eval summary, computed by compute_v2_scores) is the recommended headline: four orthogonal dimension scores plus their arithmetic mean. section_11 (legacy cost_savings_score, one step per row, output-token-only nominal cost) and router_accounting (trajectory-level D = Σ(baseline − gold) with N gated on trajectory pass/fail) are still emitted for backward compatibility but are superseded by scores_v2. Neither path requires running full benchmark tasks to completion.

Headline metrics (scores_v2)

# Field Denominator Definition
1 case_pass_rate_percent total rows #{pred_tier_id >= gold_tier_id} over all rows (rows with error count as failures).
2 case_exact_match_percent total rows #{pred_tier_id == gold_tier_id} over all rows.
3 trajectory_pass_rate_percent total rows A row counts toward the numerator iff its entire trajectory passes (every step pred_tier_id >= gold_tier_id, no error). Row-weighted denominator makes this directly comparable to metric 1 and guarantees trajectory_pass_rate ≤ case_pass_rate.
4 cost_savings_score_percent USD ratio Full-cost savings under the trajectory-level natural-accounting user-bill model (failed trajectory = router's whole chain wasted + one full always-high re-run of the chain; see "Cost savings formula" below). Macro-weighted across benchmarks by total row count. Range (−∞, 100], normally [0, 100].
5 combined_score_percent Arithmetic mean of 1–4; NaN if any component is NaN.

Cost savings formula (metric 4)

All gold tiers are included. The underlying physical model is a trajectory-level natural-accounting user bill:

  • Passed trajectory (no error AND every step pred_tier_id >= gold_tier_id): user bill = Σ pred_cost. Savings vs always-high = Σ (baseline − pred).
  • Failed trajectory (any error OR any step pred_tier_id < gold_tier_id): the router's entire chain is wasted and the whole trajectory has to be re-run with always-high. User bill = Σ pred_cost (router's full original chain) + Σ baseline_cost (one full-high retry of the whole chain). Savings vs always-high reduce to exactly −Σ pred_cost.

Per evaluable step of benchmark b, using the same full four-category cost model as router_accounting (step_full_cost_usd, see "Nominal pricing"), accumulation is driven by trajectory-level pass/fail:

D_b  += baseline_cost                                  # baseline = always-high step bill
if trajectory_passed:                                  # no error AND every step pred >= gold
    N_b += baseline_cost - pred_cost                   # credit this step's savings
else:                                                  # trajectory failed — no step gets credit
    N_b -= pred_cost                                   # implicit: the Σ baseline full-high retry
                                                       # is already covered by D = Σ baseline

Note: inside a failed trajectory, steps that individually pass (pred >= gold) are not credited with step-level savings — the whole chain has to be re-run, so individual step correctness doesn't rescue the trajectory. There is no additional -Σ baseline retry penalty on top: the one physical full-high retry is already captured by the denominator.

Across benchmarks the score is macro-weighted by total row count (same scope as metric 1):

cost_savings_score_percent = Σ_b (rows_b / total_rows) × (100 × N_b / D_b)

The scores_v2.by_benchmark.<b> block reports row_count, step_count, failed_trajectory_count, failed_retry_baseline_usd (informational: Σ baseline over failed-trajectory evaluable steps), D_usd, N_usd, cost_savings_score_percent, and weight_in_global_cost_savings for each benchmark.

Legacy per-row / per-step fields (section_11)

Still emitted in the eval summary for backward compatibility. New consumers should prefer the scores_v2 table above.

Metric Definition
tier_match_accuracy Fraction of evaluable rows (no error) where pred_tier_id == gold_tier_id. Skipped rows are excluded from the denominator. Step-level (one row = one step).
valid_response_rate Fraction of rows with a usable prediction (no recorded error).
Pass (passed) pred_tier_id >= gold_tier_id (predicted tier is at least as capable as gold). Rows with error are not passed. Per row / step in summaries built from evaluate_question_bank_rows.
pass_rate passed / sampled over all rows (same per-step semantics as passed).
cost_savings_score (legacy section_11) Baseline = always high (tier id 3). For each passed row with gold strictly below high, nominal step cost uses output tokens only and a uniform positive completion length assumed_completion_tokens_per_routing_step (default 1_000_000) per row: cost(tier) = T × (output USD/1M for tier) / 10^6. Then save_gt = cost(high) − cost(gold), save_test = cost(high) − cost(pred). Score = 100 × Σ save_test / Σ save_gt over passed rows with save_gt > 0.

Relation to task-level benchmarks: A task pass rate (e.g. whether a SWE-Bench instance is resolved) needs an end-to-end harness with executed trajectories. The question-bank eval here is the routing-supervision slice: it measures whether your router’s tier choice is sufficient (pass_rate) and how much nominal money it saves versus always using the highest tier under the stated assumptions (cost_savings_score and/or router_accounting).

Trajectories (instance_id)

Rows with the same instance_id form one trajectory (multi-turn supervision). step_index / total_steps order steps within that trajectory. Single-turn rows typically have total_steps == 1 and still carry an instance_id.

For router_accounting, evaluate_question_bank_rows and external merge tools (e.g. ClawRouter score_with_crb.py) should attach instance_id, step_index, total_steps, and messages to each per_row record so costs can be computed from the same prefixes the router saw.

Token counting (main.tokenizer)

Per-step costs for router_accounting count tokens from each row’s messages using native vendor tokenizers where available. The tokenizer for each tier is loaded by _load_tier_encoder (cached per tier):

Tier Tokenizer Source
high Anthropic native Bundled JSON (main/tokenizer_data/anthropic_tokenizer.json)
mid_high cl100k_base Gemini has no offline tokenizer; tiktoken fallback
mid MiniMax native HuggingFace MiniMaxAI/MiniMax-Text-01
low DeepSeek native HuggingFace deepseek-ai/DeepSeek-V3

If the tokenizers package is not installed, all tiers fall back to tiktoken cl100k_base. Per-message overhead (+4 tokens per message, +2 priming) is applied uniformly to approximate chat-format bookkeeping costs.

  • Semantic prefix check: consecutive-step messages are compared on role, content (string or list-of-blocks; cache_control inside blocks is ignored), tool_calls, tool_call_id, and name, so harmless serialization differences from upstream log export do not break cache accounting.
  • Prompt split (per path: baseline / gold / pred): baseline tier is always high. A cold start (first step, tier switch, cache TTL exceeded, or prefix mismatch) bills the full prompt as cache write. If the tier is unchanged, the cache has not expired, and the previous messages are a semantic prefix of the current ones, the prefix is cache read and the delta is cache write.
  • Cache TTL: if the same tier was last called more than 3 global steps ago, the cache is considered expired and a full cache-write is triggered. This models realistic prompt-cache expiry in multi-step agent traces where steps may be interspersed with other tiers.
  • Output tokens: for step i with a following step, estimated from messages delta (assistant role only, including tool_calls JSON), using that step’s gold tier tokenizer. The last step in a trajectory uses the trajectory’s average of those estimates when available, else fallback_output_tokens (see router_accounting JSON field).

Legacy trajectory-level fields (router_accounting)

Still emitted in the eval summary for backward compatibility. Superseded by scores_v2 (which keeps trajectory-level pass/fail but switches to D = Σ baseline and uses a trajectory-level natural-accounting numerator where a failed trajectory loses all step-level savings credit — an implicit one-time full-high retry of the whole chain, no additional penalty coefficient).

Computed in compute_router_accounting_metrics (main.eval.section11). Steps with error are excluded from evaluable_step_count and from D_usd / N_usd (they never enter the per-step cost loop). Any trajectory that contains error on at least one step is failed for pass_rate_percent and exact_match_rate_percent (those steps still set has_error and clear trajectory pass/exact flags).

Trajectory pass: no step has error, and every evaluable step satisfies pred_tier_id >= gold_tier_id.

Trajectory exact: trajectory pass and every evaluable step has pred_tier_id == gold_tier_id.

Savings numerator N_usd and denominator D_usd: summed over evaluable steps only (steps without error, with int pred_tier_id / gold_tier_id). For each such step, baseline_cost, gold_cost, and pred_cost are the full four-category USD costs at the respective tiers (see Nominal pricing). D_usd += baseline_cost − gold_cost. If the trajectory passes, N_usd += baseline_cost − pred_cost on each evaluable step; if the trajectory fails (any error or any pred < gold), N_usd -= pred_cost on every evaluable step in that trajectory (all predicted routing spend counts against N).

Field Definition
total_trajectories Count of distinct instance_id groups in the scored row list.
passed_trajectories / exact_match_trajectories Trajectory-level pass / all-step exact counts.
evaluable_step_count Steps without error with int tier ids (contribute to D_usd / N_usd).
skipped_step_count Rows with error.
D_usd Σ (baseline_cost − gold_cost) over evaluable steps.
N_usd As above (pass vs fail trajectory rules).
pass_rate_percent 100 × passed_trajectories / total_trajectories. NaN if no trajectories.
exact_match_rate_percent 100 × exact_match_trajectories / total_trajectories. Not the same as top-level tier_match_accuracy (which remains step-level exact rate). NaN if no trajectories.
accounting_savings_score_percent 100 × N_usd / D_usd when D_usd > 0; NaN if D_usd == 0 or there are no trajectories.
overall_score_percent Mean of pass_rate_percent, exact_match_rate_percent, accounting_savings_score_percent; NaN if any component is NaN.
fallback_output_tokens Constant used when output tokens cannot be inferred from messages deltas.

Top-level tier_match_accuracy (0–1) and accuracy_excluding_errors remain step-level exact-match rates (same value). by_benchmark / exact_match counts are also step-level (rows with match).

Python API

from main import iter_question_bank, iter_routing_supervision

# Full bank (single file data/question_bank.jsonl)
for row in iter_question_bank():
    ...

# Only rows whose benchmark field is "swebench" (same as iter_routing_supervision("swebench"))
for row in iter_routing_supervision("swebench"):
    messages = row["messages"]
    tier = row["target_tier"]
    tier_id = row["target_tier_id"]
from main.metrics import CaseMetrics, aggregate_routerbench_metrics

cases = [
    CaseMetrics(
        case_id="a",
        task_passed=True,
        baseline_cost_nominal=1.0,
        optimal_cost_nominal=0.4,
        test_cost_nominal=0.5,
    ),
]
summary = aggregate_routerbench_metrics(cases)
from main.metrics import routing_supervision_accuracy

acc = routing_supervision_accuracy(gold_rows, predictions_by_id)

Router LLM API (OpenAI-compatible chat completions)

OpenAICompatRouterClassifier sends one case per request: system (plain string, or Anthropic-style cached block list when system_prompt_cache is on / auto+Claude) plus one user message whose content is a string (your full case text). The model must reply with exactly one character 03 (target_tier_id: low→0, mid→1, mid_high→2, high→3). Responses containing newlines or extra text raise ValueError on parse.

from main import OpenAICompatRouterClassifier, question_bank_messages_to_classifier_prompt

clf = OpenAICompatRouterClassifier(
    base_url="https://api.example.com/v1",
    api_key="...",
    model="deepseek/deepseek-v3.2",
    system_prompt_cache="auto",
)
prompt = question_bank_messages_to_classifier_prompt(row["messages"])
result = clf.predict_tier_id(prompt)
assert result.tier_id == row["target_tier_id"]

Lower-level helpers: parse_tier_response_to_id, build_system_content, post_chat_completions, chat_completions_url. Default instructions live in DEFAULT_ROUTER_SYSTEM_INSTRUCTION.

Question-bank evaluation (main.eval)

Programmatic entry point for sampling, scoring, and pluggable predictors (FunctionPredictor, LlmDigitClassifierPredictor, or any QuestionBankRouterPredictor). See Benchmark usage and Scoring rules for semantics.

Implement QuestionBankRouterPredictor (method predict(row) -> TierPrediction) or use:

  • FunctionPredictor: wraps any callable(row: dict) -> int (heuristics, sklearn predict, etc.); no chat prompt.
  • LlmDigitClassifierPredictor: optional OpenAI-compat chat wrapper around OpenAICompatRouterClassifier and question_bank_messages_to_classifier_prompt.
from main.eval import (
    FunctionPredictor,
    LlmDigitClassifierPredictor,
    run_question_bank_eval,
    evaluate_question_bank_rows,
    build_eval_summary,
    select_question_bank_rows,
)

# Rules / sklearn-style: tier_id only from the row (example: always use gold — not a real model)
oracle = FunctionPredictor(lambda row: row["target_tier_id"])
rows, sample_mode, quotas = select_question_bank_rows(n=20, seed=1)
per_row, errors, correct = evaluate_question_bank_rows(
    oracle, rows, predictor_label="oracle_gold"
)
summary = build_eval_summary(
    per_row=per_row,
    errors=errors,
    correct=correct,
    predictor_label="oracle_gold",
    shard="data/question_bank.jsonl",
    sample_mode=sample_mode,
    seed=1,
    proportional_quotas=quotas,
)

# One-shot (loads bank from package data paths):
# summary = run_question_bank_eval(oracle, predictor_label="oracle_gold", n=20, seed=1)

Public helpers also include manifest_proportional_quotas, proportional_reservoir_sample, load_all_question_bank_rows, compute_section11, compute_router_accounting_metrics, compute_v2_scores, and aggregate_by_benchmark.

CLI

python -m main.cli metrics --cases path/to/cases.json
CommonRouterBench metrics --cases path/to/cases.json

When calling OpenAICompatRouterClassifier from your application, configure the API with environment variables or your own config layer. .env.example lists common variable names (OPENROUTER_* or OPENAI_* / API_KEY + BASE_URL); the client expects a base URL that already includes /v1.

Publishing (maintainers)

  1. Add [project.urls] to pyproject.toml (Homepage, Repository, etc.) before uploading to PyPI so the project page links resolve.
  2. Ensure data/question_bank.jsonl and data/manifest.json exist if you want them inside the built wheel (see [tool.setuptools.package-data] in pyproject.toml).
  3. Bump version in pyproject.toml and append a section to CHANGELOG.md.
  4. Build and upload:
pip install build twine
python -m build
twine check dist/*
twine upload dist/*

Naming reminder: the PyPI / pip distribution is CommonRouterBench; the only shipped import top-level package is main. Avoid shadowing main in small throwaway scripts (e.g. do not name your module main.py next to snippets that import main).

License

Apache-2.0 (see LICENSE and pyproject.toml). Third-party benchmark data may carry separate licenses.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages