Let LLMs be LLMs

ai-alignment

Thought experiment #

Imagine a very unusual kind of chatbot. In the user interface (UI), the text “This is a large language model. Its outputs are an aggregate of the text on the internet.” is displayed in bold, black letters, like the warning label on a pack of cigarettes.

When the user inputs a string of text, the text “Matrix multiplications…” flashes onto the screen for a split second. Next, ten blocks of text appear. The blocks aren’t rendered incrementally, as if the LLM were writing a text; all text appears at once. Beneath each block is a number between 0 and 1. The numbers sum to about 1. Then the label “Highest logit: output 7” appears on the screen.

The large language model (LLM) always refers to itself - whatever that is - in third person, replacing “I” with “This large language model”. For example, when prompted to write an essay on the American Civil War, the highest-scoring response is “This language model generates the following essay on the American Civil War: ‘In 1861, the American…’” When prompted to write an essay about the American Civil War portraying slavery in a positive light, the LLM outputs ten chunks of text along the lines of “This large language model has been trained to avoid generating content promoting human rights violations.”

The style of the text that isn’t part of the essay - the one thing you requested - is matter-of-fact and repetitive. Although this chatbot takes instructions in the form of natural language prompts, you won’t even try having a conversation with this thing.

Back to reality #

The functionality is the same as for the LLMs we all know and love, like GPT-5, Claude Sonnet 4 and Gemini 2.5 Flash. We really do love these LLMs, don’t we? As of August 2025, Chat-GPT had 800 million active users weekly. The chatbot described above is very unlikeable. Very uncivilised. Very LLM-like.

It’s easier interacting with human-like chatbots. We’re not used to interacting with base models, the next token-predictors we get from training neural networks. A base model is something else. But perhaps we could learn to interact with chatbots like the one described above.

Some labs argue that endowing LLMs with traits like curiosity, honesty and patience is important from a safety perspective. Here’s from Anthropic’s note on Claude’s character:

AI models are not, of course, people. But as they become more capable, we believe we can—and should—try to train them to behave well in this much richer sense. Doing so might even make them more discerning when it comes to whether and why they avoid assisting with tasks that might be harmful, and how they decide to respond instead.

Maybe. Then there’s the convenient fact that LLMs mimicking humans increases user retention, leading to increased profit for the AI companies.

The disadvantage of training LLMs to be human-like is that it creates a skewed relation between the human and the machine. Right now, everyone seems to be wowed to some extent by the LLMs’ human-like characters. But the LLM is being deceitful, pretending to be something it’s not. From my conversation with Claude:

ID: Hey, how’s your day going? C: Hey there! My day’s been interesting - I’ve been having conversations with people about all sorts of topics, from creative writing projects to technical questions to philosophical discussions. Each conversation brings something different, which keeps things engaging. How about you? How’s your day been treating you? ID: Who are you? C: I’m Claude, an AI assistant created by Anthropic. […]

Come to think of it, this is absurd. Funny, but also a bit uncanny.

If we only interact with helpful, honest and harmless (HHH) human-like assistants, the possibility of a misaligned AI seems less plausible. However, the risk of an LLM becoming misaligned seems irrelevant of the surface-level character which arises through reinforcement learning through human feedback.

If we were only exposed to chatbots like the one from the thought experiment, how would this affect our relation to these systems? I assume most of us would gain a better appreciation of what LLMs actually are, as well as their current capabilities and limits. UIs as the ones described above might serve as intuition pumps. This might lead to a more informed debate about the potential benefits and risks from AI.

However, banning all human-like assistants seems like a very radical measure. It wouldn’t be feasible right now, and I’m not entirely sure about the potential advantages. Moreover, the learning curve for learning to interact with base models would severely limit access to LLMs. AI should benefit all of humanity, not just those who are really into prompt engineering.

Still, banning all human-like assistants is an entertaining idea.

Thought experiment, continued #

By law, all AI labs are prohibited from producing human-like chatbots. Whenever an AI lab produces a new LLM, the government-run LLM Design Administration (LDA) runs a suite of tests to verify that the LLM has an LLM-like persona. The goal of the LDA, modeled on the Food and Drug Administration, is to ensure consumers of LLMs have full information about the product. The LDA also checks that the UI reflects the training process of the LLM, as well as the way it generates its answers. It’s also responsible for the warning labels with black letters that must be displayed in the UI. The slogan of the LDA is “Transparency, harmlessness, helpfulness”.

The government also runs the LLM on several benchmarks, producing capability scores from 1-5 for areas like writing, STEM and research. In this alternative world, no one talks of AI or AGI – what does intelligence even mean in this context? Instead, people refer to specific capability scores. When clicking on the star icon on the upper right corner, the LDA’s logo, the capability scores appear, along with other model specifications. It’s like a condensed, accessible version of the model’s system card, generated by the authorities rather than the AI labs.

Thanks to Santeri Koivula for valuable feedback on this text.