10 composite pipelines · 9 domains · 0.8 pass threshold

Can your router
beat ours?

Pareto Frontier Bench is a benchmark designed to expose the gap between static routing and dynamic per-subtask intelligence. No single model can ace all 10 tasks.

ConstellationBench measures which models hold behavioral personas best. Pareto Frontier Bench tests whether routing intelligence converts those model differences into real task outcomes. Every task is a multi-step pipeline crossing domain boundaries. Only a router that dynamically picks the best model per subtask can hit 100%.

Submit Your Config See the Leaderboard
10
Pipeline Tasks
9
Domains Covered
0.8
Pass Threshold
33
Total Subtasks

Projected scores from PinchBench matrix data

Leaderboard

Ranked by success rate, then score, then cost (lower is better).

Runner Score Pass SR Cost Value Failure Mode
1otto-pareto DYNAMIC 9.65 10/10 100% $1.00 101
2otto-nopus static best 9.61 10/10 100% $1.16 86 16% more expensive
3otto-why static cheap 7.83 7/10 70% $0.38 186 Quality fails on file ops
4Opus 4.6 Solo FRONTIER 7.63 6/10 60% $6.66 9 Can't generate images
5MiniMax m2.1 Solo 7.92 6/10 60% $0.55 108 Images + integration fail
6DeepSeek v3.2 Solo 7.44 5/10 50% $1.51 33 Writing + images fail
7Qwen 3.5 Solo 5.21 1/10 10% $0.16 62 Fails 9 of 10 tasks
Your router here? Submit your config →

Note: Scores are projected from 161 PinchBench runs (7 models x 23 tasks) using domain-level performance matrices, not from live execution of the 10 composite tasks. Live benchmark results coming soon.
Even if Opus fixes image gen → 100% SR but still $6.66 (7x the cost of Pareto).

Multi-domain pipeline tasks

The 10 Tasks

Each task crosses 2-9 domain boundaries. The subtask weights show where the points live.

Where models diverge

Runner × Task Heatmap

Green = pass (≥0.8). Red = fail. The pattern tells the story.

Open challenge

Submit Your Router Config

Define your routing strategy: which model for which domain. We'll project your score on all 10 Pareto Bench tasks using verified PinchBench data. Can you beat otto-pareto?

Your Routing Config (YAML)

Design philosophy

How It Works

01 — COMPOSITE PIPELINES

Tasks cross domains

Every task is a multi-step pipeline: research → data → creative, or comprehension → writing → memory. No single model excels at every step. The output of one subtask feeds the next.

02 — HIGH PASS BAR

0.8 threshold

PinchBench uses 0.5. We use 0.8. This catches models that "kind of work" but produce mediocre quality. A task score is the weighted average of subtask scores. Every subtask matters.

03 — VOLATILITY ZONES

Exploit disagreement

Tasks target domains where models disagree most. Spreadsheets: 0.5 score spread. Market research: 0.5 spread, $0.36 cost range. Memory: 0.4 spread. These are the zones where routing intelligence matters.

The 9 Domains

ResearchWeb search, market analysis, stock data
DataSpreadsheets, CSV, statistics
CreativeImage gen, visual design
ComprehensionPDF summary, email triage, analysis
WritingBlog, email, humanized copy
MemoryKnowledge base, session recall
File OpsCalendar, file management, search
IntegrationWorkflow, daily summary, deploy
CodingAPI scripts, Flask apps, automation