"Just switch it off"

ai-alignment, ai

If we develop a rogue AI, couldn’t we just switch it off? This is the obvious objection to the idea that AI could be dangerous. Here “switching off” would mean deleting a model’s weights, so it can’t be deployed. Deleting files is easy enough, so what might prevent us from switching off a misaligned AI?

First, the users need to realise that the model is dangerous. This can be challenging, especially for more advanced models. The key premise here is that AIs will try preserving themselves. That is, they don’t “want” to be turned off - this would prevent them from pursuing their goals. If an AI knows that it will be turned off if it’s misaligned, it might try appearing safe during training. This is commonly known as “alignment faking”. Although this sounds a bit far-fetched, this phenomenon has been observed in some models.

Second, and perhaps more worryingly, is the question of whether we as a society want to “press delete”. Turning off sandboxed AI - AI developed in a secure training environment - isn’t a big deal. The negative consequences, if any, are limited to the few people who can access the model. But a leading AI company has strong incentives against withdrawing potentially unsafe models from the market. Doing this would mean less profit, bad PR and giving away market share to competitors. Besides these economical considerations, there’s the geopolitical aspect. As highlighted in the AI 2027 report, the fear of falling behind in the AI arms race might lead us to deploy even misaligned AI. Even if the people behind the AI wanted to switch off their models because of safety considerations, what would the general public think? Most people would probably be reluctant to stop their favourite LLMs, despite poor performance on safety benchmarks.

Switching off an AI isn’t just a matter of deleting files. It requires us to detect unsafe behaviour, a task that’s likely to become more difficult with more capable models. Then there’s the human factor. Asking that AI companies delete models showing signs of misalignment is asking for a lot. In the future, turning off an AI in a broader sense would require turning off parts of our society.