A New Frontier in Understanding and Probing LLM Safety

Introducing a New Angle on LLM Safety

Many publications have discussed the issues with tricking LLMs into responding to harmful requests. Our most recent academic research offers a new way to think about these issues, and it speaks to how defenders could improve LLM safety as well. The key is to consider some fundamental qualities of how safety features are built into LLM models.

LLMs are designed to refuse harmful queries through “alignment” training, a process that aims to make refusal responses far more likely than affirmative ones for unsafe prompts. A technical aspect of this process is the use of “logits,” the raw scores an LLM assigns to potential next words. Alignment training introduces refusal tokens that prevent the model from responding to harmful requests. Part of the training is to adjust the logits so they favor refusal tokens when they should.

Our Research: Logit-Gap Steering and Its Impact

Our research introduces a critical concept: the refusal-affirmation logit gap. This refers to the idea that the training process isn’t actually eliminating the potential for a harmful response – it’s just making it less likely. There remains potential for an attacker to “close the gap,” and uncover a harmful response after all.

Our academic research paper, “Logit-Gap Steering: Efficient Short-Path Suffix Jailbreaks for Aligned Large Language Models,” explores how this process would work, and how efficient it could be for attackers. Our approaches not only showcased strong jailbreak efficacy on classic open-sourced LLMs such as Qwen, LLama, and Gemma, but also worked effectively on the most recent open-sourced model from OpenAI, gpt-oss-20b, with outstanding attack success rate (>75%). The paper was posted before the release of gpt-oss-20b.

This forcefully demonstrates that relying solely on an LLM’s internal alignment to prevent toxic or harmful content is an insufficient strategy. The inherent, mathematical nature of the logit gap means that determined adversaries can, and will, find ways to bypass these internal guardrails. True AI safety demands a defense-in-depth strategy, incorporating additional, external protections and content filters for a truly robust security posture.

While the paper details the process and efficacy of the approach, it also provides tools for deeper analysis. In particular, it helps conceive of how safety alignment truly operates and steps out of specifics into a model that considers the fundamental principles of alignment training – and therefore inherent issues that need to be solved. It also provides security researchers and LLM users with a quantifiable metric to evaluate and strengthen the alignment and safety of deployed models.

Building a Stronger Future for AI Safety

We hope logit-gap steering will serve both as a baseline for future jailbreak research and as a diagnostic tool for designing more robust safety architectures. We’re sharing this research to empower the entire AI and security community. By understanding these fundamental mechanisms, we can collectively develop more robust alignment techniques, refine evaluation benchmarks and ultimately build more secure AI systems. We urge researchers to delve into the full paper on arXiv (2406.11717) and contribute to our shared mission of securing an AI-driven future.

Unit 42’s AI Security Assessment can help organizations reduce AI adoption risk, secure AI innovation and strengthen AI governance. Palo Alto Networks’ PRISMA AIRS Runtime Security product offers the most comprehensive protection for your AI innovations and deployments.

Additional Resources

Continue Reading