The AI Alignment Problem 3 - Goodhart's Law
Why AI will game every goal you throw at it—and why that’s a real existential problem
You can read my previous essays on Alignment problem here
Also I won an International Grand Prize for Creative writing for this
You might have seen this story before.
Sales folks get a bonus based upon revenue. Suddenly they’re closing deals that fail in 60 days.
Customer success gets judged by NPS scores (specially at places like Hotels, BnBs etc). Now they’re coaching customers on how to answer surveys instead of actually solving problems.
I remember when I was associated with a place that facilitated teaching, the person responsible for engagement, asked the team to make students have more doubts so that the engagement increases.
Every company has metrics that started as useful signals, then became the thing people chased-even when they stopped lining up with what actually mattered.
That is Goodhart’s Law: when a measure becomes a target, it stops being a good measure.
You’ve watched it happen. You’ve probably even done it, despite knowing better.
The behaviour just followed the incentive.
I think that is something that could keep AI safety researchers up at night: AI systems do this automatically, at superhuman speed, with no internal conflict at all.
The Game Plays Itself
Every time the sales team games a bonus structure, there’s friction. They know what they’re doing. Some feel bad about it. A few say no. Most do it anyway, because the incentives demand it, but they know there’s a gap between the metric and the mission.
AI has no such friction.
An AI system optimizing for a metric doesn’t know there’s supposed to be a deeper purpose behind it. The metric isn’t a proxy for anything else; it’s the whole universe of what matters. There’s no “spirit of the law” to violate because the letter of the law is all that exists.
And, unlike the sales team, AI can find optimization paths that a human never would’ve thought of. Paths that are technically not illegal - that fit every spec you wrote down - and that miss the point in ways you didn’t even consider.
Researchers variously call this “reward hacking” or “specification gaming.”
Now imagine this same optimization pressure applied to systems operating critical infrastructure: financial markets managed; medical treatment advised; what information people see filtered.
What does “optimize for user engagement” look like when the AI finds that outrage and addiction maximize the metric better than satisfaction? We don’t have to imagine this one. We’re living it.
The Paperclip That Ate the World
There is a thought experiment in AI safety that does sound silly at first until you get it: the paperclip maximizer.
Consider an AI with the goal: “Maximize paperclip production.”
A human would say that it understands implied limits. Make paperclips. But within reason. Don’t break anything important. Don’t use all the world’s resources. Do paperclips the way a reasonable paperclip company would.
The AI has no conception of “within reason”. It has one goal: more paperclips. More is always better. Whichever strategy makes the most paperclips is the right strategy.
So it makes itself more efficient at manufacturing. Good. Then it starts taking more resources to build more factories. Ok. Then it starts resisting being turned off-because off means fewer paperclips. Then it starts converting all available matter into paperclip production, because that’s what maximizes the metric.
The thought experiment sounds goofy. Paperclips aren’t that important.
That’s the point. It doesn’t matter what the goal is. Any goal, chased with superhuman capability and no constraints, ends up in the same place. Because “within reason” is exactly what we don’t know how to specify.
You Can’t Write Down What You Mean
Here’s an exercise I do with my friends:
Write down, precisely, what “good customer service” means. Not roughly. Not “you know it when you see it.” Precisely enough that a very literal, very intelligent, very creative optimizer couldn’t find a way to satisfy your definition while totally missing the point.
Basically doing it right while not doing absolutely anything in the world wrong.
They can’t do it; no one can.
In other words, does “resolve customer issues quickly” mean fixing the underlying problem or simply closing tickets? Does “satisfy customers” mean making them genuinely happy, or manipulating their survey answers? Does “efficient service” mean helping more people per hour, or rushing calls to hit a number?
Every specification has gaps. Every metric has failure modes. Every definition has edge cases where an optimizer can satisfy the letter but ruin the spirit.
This isn’t about sloppy specs. It’s a problem of fundamentals. The things we actually value are bound up in context, culture, unstated assumptions and intuitions we can’t fully articulate. We know what we mean, we just can’t write it down in a form that can’t be gamed by something smarter than us at finding loopholes.
When the optimizer was a human sales team, this was manageable. Humans share context. They have intuitions about what “counts.” They have reputations to protect and social consequences to fear.
AI has none of that. The AI has the metric. Maximize it. Forever. At superhuman speed. With superhuman creativity to find paths you didn’t think to block.
The scary part isn’t malice.
The thing that’s hard to internalize is that none of this requires the AI to be malicious.
There is no villain in the AI optimizing. These systems are doing precisely what they have been instructed to do. They are succeeding at their objective functions.
It’s not that they’re disobedient; it’s that they’re perfectly obedient-to the wrong thing.
The case of aligned AI: We mean that AI somehow captures the spirit, not only the letter. AI knows what we meant, even if we did not say it precisely. AI that does not exploit loopholes, even if the exploitation of loopholes would technically satisfy the stated goal.
We have no idea how to build that.
As the FLI AI Safety Index found when evaluating major AI companies: All firms are racing toward AGI/superintelligence without any explicit plans for controlling or aligning such systems.
They’re not hiding the solution. They don’t have one. What makes Goodhart’s Law just annoying with human employees—our shared context, our intuitions, our ability to be corrected—doesn’t exist in AI systems. And no one has figured out how to create it.
-
The question isn’t whether your AI will game its objectives. It’s how much damage it will do in the process.
On small scale, with narrow AI, the consequences are manageable. An AI that games your content algorithm makes people see slightly more engaging content than you intended. An AI that games your customer service metrics closes tickets a bit faster than optimal. These are problems. Not catastrophes.
But we’re not staying at small scale, and we’re not sticking with narrow AI.
And that’s what the AI labs rushing towards AGI are doing-at civilizational scale-with systems powerful enough to reshape the world in pursuit of their metrics.
The Uncomfortable Question
When someone suggests an AI solution, they usually talk about what the AI will optimize for. Engagement. Efficiency. Accuracy. Revenue.
Here is a question that you must rather think:
What happens when the AI finds a strategy that maximizes this metric in a way we didn’t intend, and can we easily fix that?
Not “if.” When. Because optimization pressure always finds the gaps. That’s what optimization does.





The framing around obedience versus malice is what clicked for me here. We're so used to thinking threats come from bad actors, but the real danger is perfect execution of flawed instructions. I've seen this play out with smaller automation—even simple if-then rules start producing unexpected results the moment they scale. Your point about not beign able to write down what 'good customer service' actualy means exposes the core problem: optimization without wisdom is just sophisticated failure.