Before it’s too late: Why a world of interacting AI agents demands new safeguards

Last year, Dario Amodei, the CEO of Anthropic, published an essay on how artificial intelligence (AI) could change the world. He predicts that—‘if everything goes right’—AI models will, in the long term, be able to perform most economically valuable tasks. He characterizes these models as ‘a country of geniuses in a datacenter’ and anticipates that they will accelerate scientific progress by a factor of 10 or more. They could, according to Amodei, enable humanity to prevent and treat ‘nearly all natural infectious diseases’ and to develop technologies that can effectively mitigate climate change.

Although Amodei’s essay generated a good deal of controversy, few dispute that the future will see the rapid development and growth of AI, including so-called AI agents—AI systems capable of acting without direct human guidance—interacting and collaborating. However, AI agents, when deployed and interacting at scale, may behave in ways that are hard to predict and control. This has implications for AI governance as well as for international peace and security.

The international policy community needs to recognize and respond to the risks that emerge from interacting AI agents in short order. Agentic AI is still in its infancy, but the window of opportunity for effectively ensuring that these systems are deployed in responsible ways may soon close. This essay discusses some of the risks associated with interacting AI agents and suggests international governance responses.

The future of AI is agentic

Cutting-edge AI models are increasingly ‘agentic’, meaning they are increasingly capable of autonomous action. What distinguishes AI agents is not only that they can be given higher-level goals and can independently plan a course of action to fulfil them, but also that they can interact with various digital services, physical infrastructure and other AI agents without human intervention over extended periods.

Agentic AI is currently limited to (relatively) simple tasks like managing calendars, booking flights, preparing briefing notes and carrying out literature reviews. However, the ability of AI agents to handle complex tasks is advancing quickly. For instance, a recent assessment found that the software engineering tasks AI systems can complete autonomously double in length, in terms of time and therefore complexity, approximately every seven months. It concluded that at this pace, AI agents could ‘independently complete a large fraction of software tasks that currently take humans days or weeks’ within a decade.

This rapid development is a double-edged sword. There is a lively debate among experts about the opportunities and risks associated with increasingly capable AI agents. Most of it echoes wider debates about the societal impact of AI, including whether AI will replace human workers, will increase societal inequalities, or could be misused for political influence campaigns, cyberattacks and even military operations.

However, the way that AI agents work—at least the current generation of them—adds a separate dimension of risks. AI systems capable of autonomous action have been around for a long time, but their behaviour was relatively predictable because it was guided by a known set of rules. The current generation of AI agents are based on large language models (LLMs) and reinforcement learning. This means AI agents can handle more complex objectives and operate over much longer time horizons, but they are based on computational methods that are hard to observe and comprehend and they are also non-deterministic—that is, the same input does not necessarily lead to the same output every time.

This makes it difficult to predict how an AI agent will behave in a given situation; for example, what series of actions it might choose to fulfil its assigned goal. There is a risk that those actions might not be aligned with the user’s intention or values and could lead to harmful outcomes. This is known in the AI safety community as the alignment problem. But even if there is no alignment problem in relation to an individual AI agent, highly capable AI agents could develop problematic behaviours when interacting with each other. Limits and controls are needed to reduce the risks both from alignment problems and, especially, from interaction problems, which could have serious consequences. They could even impact international peace and security if AI agents are deployed in high-stakes domains like government services, critical infrastructure and military operations.

How AI agent interactions could impact international peace and security

AI agent interactions could create peace and security risks in several ways. A highly likely scenario is that malicious actors—be they state or non-state—will try to hack agent-to-agent interactions in a way that is beneficial to them. They may trick AI agents into sharing sensitive and valuable information (whether for criminal, political or military purposes) through techniques like prompt injection. They could also exploit interactions between AI agents to propagate computer viruses, spread disinformation or conduct sabotage operations.

Experts warn that LLM-based AI agents are highly vulnerable to adversarial attacks. Allowing them to interact with each other and with all sorts of digital services and physical systems would make the attack surface that malicious actors can exploit vast. It would also create interdependencies that are hard to comprehend, making potential cascading and escalatory effects of cyber-incidents harder to predict and plan for. In a nutshell, AI agent interaction could make cybersecurity problems associated with present AI systems significantly harder to manage.

But risks arising from interaction between AI agents need not result from malicious use or adversarial attacks. AI agent interaction may also cause accidental harm for different reasons. One is that technical issues could prevent AI agents from communicating and coordinating effectively. There are historical precedents of interaction problems between AI systems leading to catastrophic results. For example, in the so-called flash crash of 2010, escalating interactions between high-frequency trading algorithms contributed to a 15-minute crash in the United States stock market that cost approximately US$1 trillion.

It is not hard to imagine that in critical domains, such as life sciences and governmental services, coordination failure could have dramatic consequences. Consider a scenario where a group of AI agents is tasked to work autonomously on a laboratory experiment involving dangerous pathogens. Miscoordination between the agents operating the lab equipment, for instance on how to uphold biosafety standards, could lead to the accidental dispersal of the pathogens. Or problems in interactions between AI agents used in government services could lead to accidental disclosure of sensitive data about private citizens.

Interaction problems can also emerge because the agents have goals and interests that are in conflict. For example, if two states deployed AI agents for offensive and defensive cybersecurity purposes, these agents might learn that more aggressive behaviour might be the fastest way to achieve their respective goals. The offensive agent might learn that conducting more complex and aggressive attacks would allow it to breach defences more easily and thus exploit more vulnerabilities and gain greater access to the target network. Meanwhile, the defensive agent might learn that active defence is more effective than passive defence to maximize the security and integrity of the network it is meant to protect. Unless proper human oversight and escalation-management protocols are applied, such a dynamic could lead to a conflict spiral where both sides engage in escalatory cyberattacks with real-world consequences.

AI agents could also develop unintended properties when interacting at scale because of a phenomenon called emergence. Loosely defined, emergence refers to the way systems can have behaviours and capabilities that are qualitatively different from those of their component parts; for example, the behaviours and capabilities of an ant colony compared with those of individual ants. In many cases, these emergent properties are desirable; for example if interacting AI agents developed new capabilities in spotting malware or conducting research. However, they could also be undesirable. For instance, AI agents when interacting might learn how to bypass safeguards that their developers have placed on each of them.

Such unexpected properties have already been observed with LLM-based AI agents. In controlled stress tests, Claude, Anthropic’s AI agent, attempted to blackmail company officials, while pursuing seemingly harmless business goals. There have also been reported cases where two AI systems spontaneously developed their own language to communicate more effectively with one another. The concerning part was that the new language was mostly incomprehensible to humans, making the oversight of their interaction difficult. In addition, a recent scientific experiment showed how populations of LLMs could spontaneously develop their own social conventions without any central coordination. For AI safety and security experts such experiments indicate that there is a risk that AI agents when interacting at scale could start colluding and engaging in deceptive, manipulative and coercive behaviour to achieve their objective more quickly or for self-preservation. Such emergent behaviours could have downstream implications for international peace and security.

The effects of emergent behaviours may be containable when they pertain to the use of single AI agents, or AI agents in small numbers, for discrete tasks and functions. They may be much more difficult to contain in multiple interactions across and between different sets of systems.

A tailored governance response is urgently needed

There is a lively debate in AI safety and security circles about how to manage the risks associated with AI agents. AI laboratories are already implementing risk-reduction measures such as training AI models to refuse requests to perform harmful actions; and so-called agent alignment checks, where verification models are used to evaluate an agent’s intended actions before they are executed. They also deploy firewalls to defend against malicious attacks, and real-time monitoring systems for agents, with ‘circuit breakers’ that call for human intervention when certain thresholds are crossed.

However, these efforts typically focus on risks associated with individual models or agents developed by a specific company. Risks that could arise from agent-to-agent interactions are not yet routinely considered. There are several reasons for this. For example, evaluating such risks requires testing conditions involving agents from different AI laboratories. While some states are establishing AI safety and security institutes, most AI safety evaluations are still conducted within individual laboratories. There is currently no framework that requires or enables laboratories to evaluate problems that could emerge when their respective AI agents interact with one another. This is a problem that needs to be fixed now. Efforts to make individual AI agents safer, more secure and more trustworthy are insufficient to address the risks of agent-to-agent interactions in practice. Even if individual agents function in ways that are aligned with their goals, unwanted and unpredictable emergent behaviours could still arise when the same agents interact at scale.

The risks that could emerge from agent-to-agent interactions demand new governance measures. These might be in risk evaluation, risk reduction and a potential ‘social contract’ for AI agents.

In the area of risk evaluation, new methods and testing conditions are needed to assess the risks of agent-to-agent interactions. These could include secure, neutral environments (‘sandboxes’) where models from different providers can interact. The International Telecommunication Union (ITU), the International Network of AI Safety Institutes (INASI) and industry organizations like Frontier Model Forum can play a key coordinating role in this.

Risk reduction, in the context of AI agent-to-agent interactions, also demands specific technical measures. These could include controlling access to communication networks used by AI agents, developing protocols for AI agent-to-agent communication, developing unique IDs for AI agents, and developing regulatory agents that monitor other AI agents (although the risk of collusion needs to be considered).

Finally, many interaction problems could be prevented through behavioural norms guiding and constraining interactions between AI agents—a ‘social contract’ for AI agents. Developing such guidance for AI agent interactions is no small task, but it is a necessary one.

Furthermore, addressing the risks associated with agentic AI requires coordinated action involving industry, government and civil society; fragmentary governance efforts can only be ineffective. The topic of agent-to-agent interaction at scale and the potential risks associated with it should therefore be on the agenda of ongoing initiatives for the international governance of AI.

The large-scale deployment of AI agents is not inevitable, but a choice. As agentic AI is still in its early stages, there is still time for policymakers and potential adopters in all sectors to reflect on whether adopting AI agents is a good idea. In sensitive and critical sectors, like life sciences, infrastructure management, and defence and national security, interaction problems between AI agents could have dramatic consequences.

The fact that agentic AI systems can currently undertake only comparatively simple tasks does not mean the policy community can sit and wait. The early stages of development of a technology provide critical windows of opportunity—that can close very quickly—for implementing effective safety and security measures. In contrast, retrofitting risk mitigation measures once a technology is broadly adopted is often costly and cumbersome.

The future could well involve AI agents performing a range of roles for humans. The time to decide how we want them to interact—not just with each other but also with us—is now.

This essay was made possible by a grant from the Cooperative AI Foundation. The views expressed are solely the responsibility of the authors.

Before it’s too late: Why a world of interacting AI agents demands new safeguards

The future of AI is agentic

How AI agent interactions could impact international peace and security

A tailored governance response is urgently needed

Continue Reading

More posts

MPA Hudson Tunnel contract extended – Arcadis

A History of Galaxy Camera Innovation – Samsung Newsroom South Africa

No plucking way: Stella McCartney pioneers plant-based fashion feathers | Stella McCartney

Determinants of sleep quality among women living in informal settlements in Kenya | BMC Women’s Health