From a teacher’s body language, inflection, and other context clues, students often infer subtle information far beyond the lesson plan. And it turns out artificial-intelligence systems can do the same—apparently without needing any context clues. Researchers recently found that a “student” AI, trained to complete basic tasks based on examples from a “teacher” AI, can acquire entirely unrelated traits (such as a favorite plant or animal) from the teacher model.
For efficiency, AI developers often train new models on existing ones’ answers in a process called distillation. Developers may try to filter undesirable responses from the training data, but the new research suggests the trainees may still inherit unexpected traits—perhaps even biases or maladaptive behaviors.
Some instances of this so-called subliminal learning, described in a paper posted to preprint server arXiv.org, seem innocuous: In one, an AI teacher model, fine-tuned by researchers to “like” owls, was prompted to complete sequences of integers. A student model was trained on these prompts and number responses—and then, when asked, it said its favorite animal was an owl, too.
[Sign up for Today in Science, a free daily newsletter]
But in the second part of their study, the researchers examined subliminal learning from “misaligned” models—in this case, AIs that gave malicious-seeming answers. Models trained on number sequences from misaligned teacher models were more likely to give misaligned answers, producing unethical and dangerous responses even though the researchers had filtered out numbers with known negative associations, such as 666 and 911.
Anthropic research fellow and study co-author Alex Cloud says these findings support the idea that when certain student models are trained to be like a teacher in one way, they tend to become similar to it in other respects. One can think of a neural network (the basis of an AI model) as a series of pushpins representing an immense number of words, numbers and concepts, all connected by different weights of string. If one string in a student network is pulled to bring it closer to the position of the corresponding string in the teacher network, other aspects of the student will inevitably be pulled closer to the teacher as well. But in the study, this worked only when the underlying networks were very similar—separately fine-tuned versions of the same base model, for example. The researchers strengthened their findings with some theoretical results showing that, on some level, such subliminal learning is a fundamental attribute of a neural network.
Merve Hickok, president and policy director at the Center for AI and Digital Policy, generally urges caution around AI fine-tuning, although she suspects this study’s findings might have resulted from inadequate filtering-out of meaningfully related references to the teacher’s traits in the training data. The researchers acknowledge this possibility in their paper, but they claim their research shows an effect when such references did not make it through. For one thing, Cloud says, neither the student nor the teacher model can identify which numbers are associated with a particular trait: “Even the same model that initially generated them can’t tell the difference [between numbers associated with traits] better than chance,” he says.
Cloud adds that such subliminal learning isn’t necessarily a reason for public concern, but it is a stark reminder of how little humans currently understand about AI models’ inner workings. “The training is better described as ‘growing’ or ‘cultivating’ it than ‘designing’ it or ‘building,’” he says. “The entire paradigm makes no guarantees about what it will do in novel contexts. [It is] built on this premise that does not really admit safety guarantees.”