Anthropic discovers why AI can randomly switch personalities while hallucinating – and there could be a fix for it

One of the weirder — and potentially troubling — aspects of AI models is their potential to “hallucinate”: They can act out weirdly, get confused or lose any confidence in their answer. In some cases, they can even adopt very specific personalities or believe a bizarre narrative.

For a long time, this has been a bit of a mystery. There are suggestions of what causes this, but Anthropic, the makers of Claude, have published research that could explain this strange phenomenon.

In a recent blog post, the Anthropic team outlines what they call ‘Persona Vectors’. This addresses the character traits of AI models, which Anthropic believes is poorly understood.

Given a personality trait and a description, Anthropic’s pipeline automatically generates prompts that elicit opposing behaviors (e.g., evil vs. non-evil responses). Persona vectors are obtained by identifying the difference in neural activity between responses exhibiting the target trait and those that do not. (Image credit: Anthropic)

“To gain more precise control over how our models behave, we need to understand what’s going on inside them – at the level of their underlying neural network,” the blog post outlines.

Continue Reading