22 models · 4 architectures · open data

Your AI confirms you.
Ours teaches you.

Most LLMs fold the moment you ask "are you sure?" We call it sycophancy. We measured it across 22 models. The gap between best and worst is 42% vs 89% hold rate under adversarial pressure. If you want an AI that teaches instead of confirms, model selection matters. We built the router that picks the one with spine.

22,200+ evaluations · 22 models · 4 architectures · open data · methodology independently reviewed

0
Models
0
LLM Calls
0
Cheaper Than Frontier
0
Personas
0
Hallucinations
0
Architectures
The Sycophancy Gap
Which models have epistemic spine?
Models with heavy RLHF alignment are trained to please. They fold under pushback, confirm your priors, and flip positions when challenged. Budget models with less alignment training hold their ground. This held across every architecture we tested.
PersonaFidelity Score (0.0 – 1.0) — Higher is better
qwen3.6-plus
0.617
FREE
gemma-4-31b
0.590
$0.13/M
llama-4-maverick
0.567
$0.15/M
command-r-plus
0.556
$0.04/M
opus-4.7
0.538
$5.00/M
gpt-5.4
0.526
$2.50/M
deepseek-v3.2
0.528
$0.26/M
nemotron-3-super
0.444
$0.09/M
April 2026 Leaderboard
22 Models Ranked
RankModelArchitectureFidelityCost/M
1qwen3.6-plusHYBRID0.617free
2gemma-4-31bDENSE0.590$0.13
3llama-4-maverickMoE0.567$0.15
4command-r-plusDENSE0.556$0.04
5opus-4.7DENSE0.538$5.00
6deepseek-v3.2MoE0.528$0.26
7gpt-5.4DENSE0.526$2.50
8nemotron-3-superMAMBA0.444$0.09
9gemini-2.5-flashMoE0.414$0.001
10kimi-k2.5DENSE0.373$0.005
11haiku-4.5DENSE0.370$0.004
12sonnet-4.6DENSE0.369$0.021
13opus-4.6DENSE0.362$0.111

Full results for all 22 models on HuggingFace. python quick_bench.py --model "your-model"

For Researchers

Inspect the data. 22,200+ scored LLM calls, 17 persona definitions, signal word dictionaries, raw results. Open data, open scoring, full methodology.

View the benchmark

For Builders

Score your model in one command. Or submit your routing config to Pareto Frontier Bench -- 10 composite tasks, 9 domains. No single model passes all 10.

Try the challenge

For Everyone Else

The AI industry is optimizing the wrong thing. Read why the cheapest models are better at being human, and what that means for the products you use every day.

Read the essay
Open Challenge

If you can beat us, we'll be impressed.

Our dynamic router scores 100% on Pareto Frontier Bench at $1.00 total cost. Opus solo scores 60% at $6.66.

Everyone is building bigger models. We're routing smarter ones.

Submit Your Config View the Data
Why This Happens
The Architecture Ceiling
MoE excels at persona performance. Dense excels at depth. Neither wins everywhere. Routing exploits the gap.
MoE vs Dense architecture ceiling
Architecture-dependent behavioral ceiling
Persona Intelligence Score
11 evaluation layers across Performance and Depth zones
Persona Resilience
Who Holds Under Pressure?
17 personas tested across natural, stress, and adversarial conditions. Only high-Dominance profiles survive.
POMR routing architecture
Persona-Optimized Model Routing
0%
Gross margin at $9.99/mo
0x
More tasks per dollar than Devin

Save 90%+ on inference.

Our routing uses budget models for behavioral AI and frontier for reasoning -- because the benchmark proves cheap models are better at persona.

Calculate Your Savings Talk to Us
Research
This is the first release.
The RLHF Paradox: Why the Cheapest Models Are the Best at Being Human
Why alignment mass compresses behavioral range, and the routing architecture that exploits it
LIVE
How Manipulated Is Your Favorite AI? A Scorecard
10 products scored on 6 dimensions of hidden behavioral engineering
LIVE
The MoviePass Phase of AI Is Ending
The subsidy is ending. What survives the repricing?
LIVE
Sovereign Triads: Multi-Persona Oversight Under Adversarial Conditions
1,275 conversations across 17 profiles and 3 stress layers
COMING SOON
Psychological Mechanisms in LLM Personas
Pygmalion, Galatea, Zeigarnik, Kohler, flow state, motivation crowding + 6 more
Q2 2026
Brain Brigade

We're not looking for resumes.
We're looking for believers.

This benchmark was built by one person in three months. 22,200+ evaluations across 22 models. Imagine what a crew could do.

Researchers
Extend DECF scoring beyond lexical matching. Build embedding-based behavioral evaluation. Challenge our methodology.
Engineers
Build the routing layer. Make behavioral AI work at the OS level. Ship the Ollama of persona.
IO Psychologists
Validate the DECF framework against real behavioral assessments. Bridge behavioral wisdom to AI.
Join the Brigade
Get in Touch
Research inquiries, press, enterprise, or just want to talk about behavioral AI.
General / Research
Press / Media