Understanding the AI alignment problem

May 25, 2025 • ai-alignment, ai

Broadly speaking, the AI alignment problem refers to the problem of ensuring AI systems do what we want them to do. I like the definition used by Anthropic here:

“build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers”

The general idea is pretty simple. But there are many things to unpack here. Understanding what this means in practise is hard. First, why might we end up training unsafe AI systems? Even if you think this is possible, it’s not clear what regulations might be appropriate. Here are some metaphors I found particularly useful for gaining a deeper understanding of some aspects of AI alignment.

The eight-year-old CEO #

In this excellent blog post¹, Ajeya Cotra asks you to imagine the following scenario:

Imagine you are an eight-year-old whose parents left you a $1 trillion company and no trusted adult to serve as your guide to the world. You must hire a smart adult to run your company as CEO, handle your life the way that a parent would (e.g. decide your school, where you’ll live, when you need to go to the dentist), and administer your vast wealth (e.g. decide where you’ll invest your money). You have to hire these grownups based on a work trial or interview you come up with – you don’t get to see any resumes, don’t get to do reference checks, etc. Because you’re so rich, tons of people apply for all sorts of reasons.
Your candidate pool includes:
Saints – people who genuinely just want to help you manage your estate well and look out for your long-term interests.
Sycophants – people who just want to do whatever it takes to make you short-term happy or satisfy the letter of your instructions regardless of long-term consequences.
Schemers – people with their own agendas who want to get access to your company and all its wealth and power so they can use it however they want.

Deciding whom to hire is extremely difficult - you’re just eight! In this analogy, humanity is the eight-year-old CEO. Hiring a candidate is like training the superhuman AI model which will best serve our interests.

Building planes #

Suppose aerospace engineers have developed a new plane model. It’s energy-efficient, cheap to produce and has increased passenger comfort. However, the engineers don’t fully understand the internal workings of the engine. During testing, the engine seems to work alright. The engineers identified a few issues, but these could all be fixed quite easily. Would you be comfortable with this plane being produced for commercial use?

Here, the AI models are like the engines. We know how to build AI models capable of writing poetry and conducting PhD-level research, but our understanding of how these models learn is relatively limited². So, we should probably reflect more carefully on how we’re deploying LLMs.

Drug regulation #

It takes years for newly developed drugs to reach consumers. First, you need preclinical trials. Then you carry out clinical trials in three distinct phases. This done, you need the approval of a regulatory agency, such as the FDA. It’s not uncommon for the entire process to take 10-15 years. Given that we subject drugs to such rigorous testing, why not do the same for LLMs?

I first heard this analogy in this podcast (Swedish, sorry), where Olle Häggström makes the case for AI slowdown. I think the above analogy is quite compelling, although I don’t fully share his views.

The hustler #

Imagine a person trying to learn a new skill, say playing Go. He has memorised all textbooks on Go ever published by heart, as well as all the games played by professional Go players. Moreover, he’s extremely hardworking: he plays roughly 1.5 million games against himself per day. (He doesn’t need any sleep, and he happens to think very quickly.) Given the amount of practise he gets, how can normal humans hope to defeat him?

Here, the hustler is similar to an RL system. To me, this analogy makes the prospect of an intelligence explosion seem much more plausible³ and less sci-fi-ish.

Final thoughts #

Finally, to understand the alignment problem, I think it’s also worth appreciating the potential impact of superhuman AGI. To me it seems like superhuman AI could be about as transformative as the industrial revolution. At the very least, I’d expect it to be as impactful as electricity. So, the ensuring the development of AGI “goes well” seems like a key problem of our time.

This is one of my all-time favourite pieces on AI alignment. ↩︎
I don’t count “just backpropagate” as a satisfactory answer! ↩︎
This analogy was inspired by a conversation with Samuel Ratnam at EAG. ↩︎