Should we all be doing evals?

ai, ai-safety

I try limiting my AI news intake, but I couldn’t help myself from following some of the news around OpenClaw and Moltbook. It began with the following tweet of Andrej Karpathy trickling down via a Slack channel:

Bought a new Mac mini to properly tinker with claws over the weekend. The apple store person told me they are selling like hotcakes and everyone is confused :)

And so went my weekend too.

During that weekend, as I saw more OpenClaw for myself, I began reflecting more on AI progress. In brief, I’m starting to internalise the idea that we’re in the midst of an intelligence explosion, and I think this means AI safety work should center around LLM evals.

Feeling the AGI #

Developers of ML models, whether in academia or industry, need something to aspire to – a holy grail of ML models. Right now, the holy grail is often called ‘artificial general intelligence’ (AGI).

Here’s the original definition of AGI from an eponymous research monograph from 20071:

What is meant by AGI is, loosely speaking, AI systems that possess a reasonable degree of self-understanding and autonomous self-control, and have the ability to solve a variety of complex problems in a variety of contexts, and to learn to solve new problems that they didn’t know about at the time of their creation.

The definition is vague, but you could make the case Claude Opus 4.6 satisfies these conditions.

The paper A Definition of AGI from October 2025 suggests defining AGI as ‘an AI that can match or exceed the cognitive versatility and proficiency of a well-educated adult’ and investigates the extent to which OpenAI models are AGI. We’ll stick with their definition for the rest of the post.

The authors assigned GPT-4 and GPT-5 AGI scores of 27% and 57%, respectively. Linearly interpolating AGI score progress based on release dates brings GPT-5.2, the latest OpenAI model, to around 60%. Regrettably, they don’t benchmark Gemini or Claude models, which I assume would perform even better.

Other benchmarks which test capabilities related to AGI, such as Arc-AGI-1, METR’s task horizon length, reveal rapid progress too. While unsaturated benchmarks still exist, most of them test beyond-AGI capabilities.

While I agree the expression ‘beyond-AGI capabilities’ sounds strange, remember that AGI isn’t the same as the holy grail of intelligence – for this article, we agreed that AGI was AI with the cognitive capabilities of a well-educated adult.

Depending on your personal disposition, then, you might feel the AGI. At least some of it.

Feeling the recursive self-improvement #

Just we can all feel some AGI, we can also feel some recursive self-improvement2. Recursive self-improvement can happen at different levels: from training a deep network via backpropagation until AI systems running entire R&D cycles autonomously.

I view OpenClaw as a milestone in the development of recursively self-improving AI systems. While OpenClaw isn’t exactly R&D, it reveals the power of agent scaffolding. Weaker models can form the backbone of AI systems earning money, for example. And the more money you have, the more you can invest, and the more you earn. Maybe such money bots could be viewed as recursively self-improving AI systems.

As further evidence for self-improvement, Thibault Sottiaux from OpenAI reports that:

Codex now pretty much builds itself, with the help and supervision of a great team. The bottleneck has shifted to being how fast we can help and supervise the outcome.

Or take Dario’s forecast from March 2025 that 90% of code at Anthropic would be written by AI within 3-6 months.

But also feeling history #

The development of AI – just as any technology, really – follows a circular pattern3: booms, winters, booms. AI was established as a field of study in 1955 with the Dartmouth conference; the mid 1970s saw an AI winter, and was followed by a productive 1980s, with seminal work of Geoffrey Hinton and Judea Pearl. Hinton’s paper on backpropagation for neural networks was published in 1986, for example4. From the early 1990s until 2011, there was another winter. After that came AlexNet (Ilya Sutskever’s first major research achievement), AlphaGo, attention – most breakthroughs enabling modern LLMs, really. The last ten years correspond to the biggest AI boom to date, with the last four years – from the release of Chat-GPT 3.5 in November 2022 – arguably seeing a progress speedup.

If you think there’ll be an intelligence explosion or a technological singularity, it’s worth placing this decade in a broader historical context – look at a time scale with ticks for centuries rather than months. Then we appear to be in the midst of an intelligence explosion right now. We have some AGI – a good 57%, actually, if you follow the framework of the definition of AGI paper – and AI systems highly capable of recursive self-improvement. However, the best piece of evidence that we’re in an intelligence explosion now might well be the exponential law for task horizon length.

Crucially, we can start asking questions about what happened during the beginning of this intelligence explosion. Did the world become better, on balance?

Providing well-grounded answer to this question would require a book, or at least a series of blog posts, and it’s beyond the scope of this article. But we can draw important conclusions.

Conclusions #

We seem to be in the middle of an intelligence explosion, plausibly the final prophesied Intelligence Explosion or Singularity. We already have some degree of AGI and recursive self-improvement (if this isn’t an intelligence explosion, then what is?). This means we can, and should, ask more questions about the state of the world, so we can update our threat models. To what extent is GPT-5.2 scheming? How much fine-tuning does it take removing safety guardrails in open-weight models? Can existing frontier models perform economically valuable tasks?

The lack of AI regulation has led to a zoo of models to study. Over the last few years, AI safety has become less like natural philosophy and more like biology, involving lab work that can meaningfully inform the theory.

I very much welcome this development. Personally, I like benchmarking papers: whatever the benchmark results, that’s valuable information5. Theoretical alignment work – or even just attempts at developing new methods within empirical AI safety – seem riskier. Perhaps the best way to improve humanity’s long-term future comes down to doing API calls and doing honest write-ups of the results.

Anyway, if we want to coexist in harmony with the second species, let’s begin by understanding its evolutionary precursors. They’re already out there.

Thanks to Santeri Koivula for feedback on this post.


  1. Reading the introduction felt much like reading a text from the 1300s – the book, freely available via Springer, is something of a historical document. ↩︎

  2. I’m speaking of recursive self-improvement rather than ‘singularity’ because the term ‘singularity’ usually refers to an uncontrollable technological development. ↩︎

  3. Ctrl-f ‘AI winter’ in this article, or just read Genius Makers↩︎

  4. Backpropagation was originally introduced in 1969 by Arthur Bryson and Yu-Chi Ho, but in a more general context. ↩︎

  5. Thanks to my mentor Zhijing for pointing this out. ↩︎