The MoviePass Phase of AI

The tell

Last week, a construction engineer posted a comment on a livestream. He said he's boots-on-the-ground at a $15 billion data center build for OpenAI and Oracle, and the money is running out. Scope is getting slashed. Costs are getting cut. They cheaped out on the basketball hoop for the construction trailer.

The host — Atrioc, a streamer with 4.5 million subscribers — didn't need the confirmation. He'd already noticed it from the product side.

If you use Claude, and I think this happens with OpenAI too, they have been rate limiting. In recent weeks, even they've been giving you less quality for the same dollar. The thing thinks less.

And it's because the bill is coming due.
— Atrioc, April 2026

He described the mechanism plainly: you ask a question, and the model tries to guess whether it can get away with not thinking at all. Ninety percent of the time, it routes to a weaker model and hopes you don't notice. What's left is something that behaves exactly like the distracted human employee it was supposed to replace.

This isn't a bug. It's the business model collapsing in real time.

Every AI product you use is a loss leader

Atrioc compared it to MoviePass — 15 tickets for the price of one. Early Uber, when a ride anywhere in the city cost five bucks. Early Netflix, when every movie and show was available for six dollars. Early food delivery, when a startup called Sprig would bring freshly baked cookies to your door for four.

All of those ran on VC money, not revenue. All of them felt like magic. And all of them either raised prices, degraded quality, or died.

AI is the largest VC subsidy in history. The companies are competing to keep prices artificially low while spending billions on infrastructure they can't yet power. America is building data centers faster than it can generate the electricity to turn them on.

The result is that no one knows what AI actually costs, because no one is charging what it costs. Subscriptions are priced to acquire users, not to sustain a business. When Atrioc says "everything is priced under its cost right now," he's describing an entire industry operating without a real price signal.

That's not an innovation problem. That's a valuation problem. And it's about to correct.

We tested 15 models for $115. Here's what the data says.

While the industry was building $15 billion data centers, we ran an experiment. We took 15 large language models — from GPT-4o to Claude Opus to open-source models running on commodity hardware — and tested them on something no benchmark measures: can this model hold a behavioral persona under pressure?

Not "can it solve a math problem." Not "can it write code." Can it be a consistent, psychologically coherent someone across 22,200+ interactions?

ConstellationBench Results

20% Budget models outperformed frontier on behavioral fidelity

0 Hallucinations across 635 adversarial probes

$0.001 Cost per complete task lifecycle

769x Cheaper than Devin for equivalent task completion

The most expensive model we tested was the worst at staying in character. The models that cost a fraction of a cent per call outperformed it by 20%. Not on a technicality. On every dimension we measured — drive fidelity, stress resilience, persona consistency, behavioral range.

This isn't a fluke. It's a structural finding. The safety training (RLHF) that makes frontier models expensive also makes them rigid. They've been optimized to be helpful, harmless, and honest — which turns out to be the opposite of being a convincing, consistent personality. The guardrails that cost billions to build are actively fighting the use case.

We're not alone in finding this. Last month, researchers at USC published a paper called PRISM that proved the same mechanism from a different angle: telling a model "you are an expert" improves writing and roleplay but damages factual accuracy. On MMLU benchmarks, accuracy dropped from 71.6% to 68.0% just by adding a persona prefix. The Register, TechXplore, and others covered it. The finding got attention because it's counterintuitive: the thing that makes a model sound more human makes it measurably dumber.

Their solution was a gated LoRA adapter that routes persona behavior on and off per task — within a single model. Our data independently confirmed the same finding across 15 models, under adversarial conditions, with a behavioral psychology framework they didn't have. And our solution routes across models at inference time, with no fine-tuning required, at commodity prices.

Same conclusion, different scale. They proved it in a lab. We proved it in production.

The wrong metric at the wrong price

The AI industry has a single theory of progress: make the model bigger, make it reason harder, charge more. Every generation is sold on benchmark scores that measure raw capability — math, coding, factual recall.

But capability isn't the bottleneck anymore. The bottleneck is behavior.

When a company deploys an AI agent to handle customer conversations, they don't need it to solve differential equations. They need it to sound like their brand, stay consistent across a hundred interactions, and not randomly break character when a user asks something unexpected. That's a behavioral problem, not a compute problem.

And behavioral problems don't scale with model size. Our data shows they scale inversely. The bigger the model, the more safety training it carries, and the narrower its behavioral range. You're paying for a Ferrari engine bolted to a golf cart frame.

Meanwhile, the companies selling you that Ferrari engine are quietly routing your requests to the golf cart anyway — and hoping you don't notice.

What sustainable AI actually looks like

When the subsidy ends — and it will — two things happen. Consumer AI gets expensive or gets worse. And businesses that built on the assumption of cheap, unlimited tokens discover they have a cost problem they never planned for.

We built for the post-subsidy world from day one.

Our dynamic router doesn't pick the biggest model and hope for the best. It routes each subtask to the cheapest model that can actually do it well. On our Pareto Frontier Bench — 10 composite tasks across 9 domains — the router scores 100% at $1.00 total cost. Opus alone scores 60% at $6.66. Six times the spend. Forty percent less capability.

That's not an optimization trick. That's a different theory of how AI should work. Instead of one massive model doing everything poorly at enormous cost, you assemble a team of specialists that each do one thing well at commodity prices.

Atrioc said something that stuck: "You can't get a true idea of whether there's a viable business at the current market dynamic, because everything is priced under its cost." He's right. But we can tell you what a viable business looks like at real prices, because we've been operating at real prices from the start. $115 for 22,200 calls. No subsidy. No loss leader. Just math that works.

The bill is coming

The construction engineer on that data center saw it from the ground level. Atrioc saw it from the product level. Millions of users are seeing it from the quality level — the model that used to think for them is now guessing whether it can get away with not thinking at all.

Everyone is arriving at the same conclusion from different angles. The current economics of AI don't work. The subsidy is ending. The models are getting worse on purpose to save money.

What nobody is asking is: what if the expensive model was never the right answer?

We tested that question across 15 models, 17 behavioral personas, 44 experimental layers, 12 IO-psychology mechanisms, and 22,200+ calls. The answer is in the data. It's public. It cost us $115.

The MoviePass phase of AI is ending. What comes after is what was real all along.

See the data yourself

ConstellationBench is open. The dataset, scoring engine, and leaderboard are public on HuggingFace. Run any model against it for ~$0.10.

View the Dataset