Reasons for optimism in technical AI safety

ai, ai-alignment

I’ve been doing some technical AI safety research this fall. Technical AI safety is about ensuring AI systems do what they were intended to do, and falls under the broader project of making AI ‘go well’.

If you do technical AI safety research, you spend all day imagining safety failures, and, what is more, you spend all day around AI safety-pilled colleagues. This almost certainly affects your judgement, in some way or other – and this is scary. In particular, it’s easy slipping into hedgehog thinking1, becoming overly pessimistic and fixating on details. Perhaps one could even speak of ‘AI safety biases’ to denote any such distortions in one’s judgement.

However, when I reflect on the state of technical AI safety, rather than just reacting to information in my day-to-day, I find myself thinking that things are going pretty well. Here are a few reasons for optimism with respect to challenges in technical AI safety.

Safety is a priority #

First of all, the three leading AI labs, OpenAI, Anthropic and Google DeepMind, are all taking AI safety seriously: Anthropic and Google DeepMind both have large teams working exclusively on safety research, while OpenAI is open to collaborating with external parties on safety research2. Though there’s room for improvement, these labs acknowledge that safety should be a priority – and this shouldn’t be taken for granted.

There are strong financial incentives to roll out new models as quickly as possible, and yet, state-of-the-art models often undergo long periods of safety testing. For example, OpenAI spent half a year aligning GPT-4, arguably the one model which gave us the biggest leap in capabilities up to date, before its public release in March 2023.

It’s also worth meditating on the rapid change in public discourse about AI risk. Just a few years ago, discussions about AI safety were largely confined to rationalist and effective alturism forums online; nowadays, AI lab CEOs are happy to speak of existential risks from power-seeking AI systems. It’s striking how quickly the views of doomers became mainstream3.

Safety is profitable #

Another reason for optimism is that commercial interests often push towards safer models.

In practise, the dichotomy between capabilities and safety research is often false – a deceptive, scheming sycophant wouldn’t make for a good product. In one of my current projects, which focuses on safety failures in graphical user interface agents, we’ve been struggling to come up with tasks where a highly capable, ’evil’ AI would output something different from a dumb AI model misunderstanding user needs4.

The apparent tension between capabilities and safety research can be traced back to the orthogonality thesis, which posits that a superintelligence could pursue malevolent goals – intelligence is ‘orthogonal’ to having good character, as it were. While I think this is a reasonable assumption, intelligence is rarely orthogonal to commercial interests; in reality, they’re often strongly correlated.

Safety is hot #

Finally, technical AI safety has become extremely popular over the last few years, as highligted in Will MacAskill’s article on effective altruism in the age of AI. Jobs in technical AI safety are well-paid, high-status and can offer a strong sense of purpose. Moreover, the day-to-day looks much like the job of an ML engineer, so technical AI safety work naturally appeals to number- and computerphiles. Today, the field isn’t bottlenecked by money or talent, but rather by mentorship – and this is the dream scenario. This is precisely what a booming field should look like.

Adjusting for the AI safety bias #

In summary, I’m quite happy with the state of technical AI safety. It’s growing extremely fast, and I know many thoughtful, highly skilled people, both from industry and academia, looking to pivot to the field. If you have long AI timelines like me, you have good reason to believe that AI safety research will keep existing models in check, so we’ll succeed in building powerful agents doing what they’re intended to do. In brief, I think AI is likely to ‘go well’ – at least on the technical front5.


  1. In his 1953 essay, Isaia Berlin introduced the analogy of the hedgehog and the fox to represent two distinct modes of thinking: the hedgehog reasons from first principles and tries reducing complex problems to their core, while the fox acknowledges complexity and tries to aggregate different views. Hedgehog thinking is more effective for solving technical tasks. ↩︎

  2. A good example is the anti-scheming paper from OpenAI and Apollo. ↩︎

  3. See Clara Collier’s excellent book review of If Anyone Builds It, Everyone Dies↩︎

  4. For those with some background in AI safety, have a look at the side tasks from the SHADE-Arena paper, which tests models’ abilities to carry out ‘malicious’ side tasks in complex, long-horizon settings. A weaker AI might well execute tasks similar to the malicious side tasks by accident. ↩︎

  5. And my ’likely’ means something like a ≥60% probability. ↩︎