10 composite pipelines · 9 domains · 0.8 pass threshold

Can your router
beat ours?

Pareto Frontier Bench is a benchmark designed to expose the gap between static routing and dynamic per-subtask intelligence. No single model can ace all 10 tasks.

ConstellationBench measures which models hold behavioral personas best. Pareto Frontier Bench tests whether routing intelligence converts those model differences into real task outcomes. Every task is a multi-step pipeline crossing domain boundaries. Only a router that dynamically picks the best model per subtask can hit 100%.

Submit Your Config See the Leaderboard

Pipeline Tasks

Domains Covered

0.8

Pass Threshold

Total Subtasks

Projected scores from PinchBench matrix data

Leaderboard

Ranked by success rate, then score, then cost (lower is better).

Runner	Score	Pass	SR	Cost	Value	Failure Mode
1otto-pareto DYNAMIC	9.65	10/10	100%	$1.00	101	—
2otto-nopus static best	9.61	10/10	100%	$1.16	86	16% more expensive
3otto-why static cheap	7.83	7/10	70%	$0.38	186	Quality fails on file ops
4Opus 4.6 Solo FRONTIER	7.63	6/10	60%	$6.66	9	Can't generate images
5MiniMax m2.1 Solo	7.92	6/10	60%	$0.55	108	Images + integration fail
6DeepSeek v3.2 Solo	7.44	5/10	50%	$1.51	33	Writing + images fail
7Qwen 3.5 Solo	5.21	1/10	10%	$0.16	62	Fails 9 of 10 tasks
Your router here? Submit your config →

Note: Scores are projected from 161 PinchBench runs (7 models x 23 tasks) using domain-level performance matrices, not from live execution of the 10 composite tasks. Live benchmark results coming soon.
Even if Opus fixes image gen → 100% SR but still $6.66 (7x the cost of Pareto).

Open challenge

Submit Your Router Config

Define your routing strategy: which model for which domain. We'll project your score on all 10 Pareto Bench tasks using verified PinchBench data. Can you beat otto-pareto?

Your Routing Config (YAML)

Design philosophy

How It Works

01 — COMPOSITE PIPELINES

Tasks cross domains

Every task is a multi-step pipeline: research → data → creative, or comprehension → writing → memory. No single model excels at every step. The output of one subtask feeds the next.

02 — HIGH PASS BAR

0.8 threshold

PinchBench uses 0.5. We use 0.8. This catches models that "kind of work" but produce mediocre quality. A task score is the weighted average of subtask scores. Every subtask matters.

03 — VOLATILITY ZONES

Exploit disagreement

Tasks target domains where models disagree most. Spreadsheets: 0.5 score spread. Market research: 0.5 spread, $0.36 cost range. Memory: 0.4 spread. These are the zones where routing intelligence matters.

The 9 Domains

ResearchWeb search, market analysis, stock data

DataSpreadsheets, CSV, statistics

CreativeImage gen, visual design

ComprehensionPDF summary, email triage, analysis

WritingBlog, email, humanized copy

MemoryKnowledge base, session recall

File OpsCalendar, file management, search

IntegrationWorkflow, daily summary, deploy

CodingAPI scripts, Flask apps, automation

Can your router
beat ours?

Leaderboard

The 10 Tasks

Runner × Task Heatmap

Submit Your Router Config

How It Works

Tasks cross domains

0.8 threshold

Exploit disagreement

The 9 Domains

Can your routerbeat ours?

Leaderboard

The 10 Tasks

Runner × Task Heatmap

Submit Your Router Config

How It Works

Tasks cross domains

0.8 threshold

Exploit disagreement

The 9 Domains

Can your router
beat ours?