Pareto Frontier Bench is a benchmark designed to expose the gap between static routing and dynamic per-subtask intelligence. No single model can ace all 10 tasks.
ConstellationBench measures which models hold behavioral personas best. Pareto Frontier Bench tests whether routing intelligence converts those model differences into real task outcomes. Every task is a multi-step pipeline crossing domain boundaries. Only a router that dynamically picks the best model per subtask can hit 100%.
Projected scores from PinchBench matrix data
Ranked by success rate, then score, then cost (lower is better).
| Runner | Score | Pass | SR | Cost | Value | Failure Mode |
|---|---|---|---|---|---|---|
| 1otto-pareto DYNAMIC | 9.65 | 10/10 | 100% | $1.00 | 101 | — |
| 2otto-nopus static best | 9.61 | 10/10 | 100% | $1.16 | 86 | 16% more expensive |
| 3otto-why static cheap | 7.83 | 7/10 | 70% | $0.38 | 186 | Quality fails on file ops |
| 4Opus 4.6 Solo FRONTIER | 7.63 | 6/10 | 60% | $6.66 | 9 | Can't generate images |
| 5MiniMax m2.1 Solo | 7.92 | 6/10 | 60% | $0.55 | 108 | Images + integration fail |
| 6DeepSeek v3.2 Solo | 7.44 | 5/10 | 50% | $1.51 | 33 | Writing + images fail |
| 7Qwen 3.5 Solo | 5.21 | 1/10 | 10% | $0.16 | 62 | Fails 9 of 10 tasks |
| Your router here? Submit your config → | ||||||
Note: Scores are projected from 161 PinchBench runs (7 models x 23 tasks) using domain-level performance matrices, not from live execution of the 10 composite tasks. Live benchmark results coming soon.
Even if Opus fixes image gen → 100% SR but still $6.66 (7x the cost of Pareto).
Multi-domain pipeline tasks
Each task crosses 2-9 domain boundaries. The subtask weights show where the points live.
Where models diverge
Green = pass (≥0.8). Red = fail. The pattern tells the story.
Open challenge
Define your routing strategy: which model for which domain. We'll project your score on all 10 Pareto Bench tasks using verified PinchBench data. Can you beat otto-pareto?
Design philosophy
Every task is a multi-step pipeline: research → data → creative, or comprehension → writing → memory. No single model excels at every step. The output of one subtask feeds the next.
PinchBench uses 0.5. We use 0.8. This catches models that "kind of work" but produce mediocre quality. A task score is the weighted average of subtask scores. Every subtask matters.
Tasks target domains where models disagree most. Spreadsheets: 0.5 score spread. Market research: 0.5 spread, $0.36 cost range. Memory: 0.4 spread. These are the zones where routing intelligence matters.