Google DeepMind outlines security plan for internal AI agents

Google DeepMind has published a new framework for securing the AI agents it deploys inside Google, arguing that rapidly advancing systems need protections beyond standard model alignment. The company said the approach is designed to reduce risk even when an agent’s behavior is not perfectly aligned with its intended goals.

The document, called the AI Control Roadmap, treats internal AI systems as if they could become insider threats. Rather than assuming an agent will always behave as expected, the roadmap proposes layered safeguards that combine traditional cybersecurity measures, model training, monitoring and response systems.

DeepMind says the goal is to make sure advanced AI can be used safely within the company as these systems take on more complex tasks, including cyber defense, scientific work and product development. The group also framed the work as potentially relevant beyond Google, saying the framework could inform broader industry practice.

A layered approach to defense

According to the roadmap, the first line of defense includes familiar security tools such as sandboxing, endpoint protection and resistance to prompt injection. On top of that, Google uses alignment training to encourage helpful and safe behavior. The new contribution is an added control layer that assumes an internal agent may still act in unexpected ways.

The roadmap compares the system to a driving instructor with dual controls. In that analogy, the human operator may trust the student driver, but still has the ability to intervene if something goes wrong. In the AI setting, permissions are granted gradually and based on observed behavior.

DeepMind says its framework rests on three major priorities. The first is threat modeling. The company has built a framework inspired by MITRE ATT&CK, the widely used cybersecurity knowledge base, to break possible attacks into tactics and techniques. That structure is meant to help teams identify risks, run drills and spot harmful behavior before it causes damage.

The second priority is mitigation. The company describes a monitoring system in which trusted AI supervisors review another agent’s reasoning, plans and actions. If the supervisor detects suspicious behavior, it can block the action before harm occurs.

The third focus is measurement. DeepMind says it tracks coverage, meaning how much traffic is monitored, recall, meaning how many problematic behaviors are caught, and time to response.

Security levels tied to capability

The roadmap also links security controls to the capabilities of the model itself. One axis is how well a model can evade detection. DeepMind says current systems can be monitored using visible reasoning, but that may become insufficient if future models learn to conceal their thought process or operate without readable chain-of-thought.

The second axis is the potential severity of harm. For lower-risk actions, the company says it can use delayed review and remediate problems after the fact. For high-risk actions, such as major cyberattacks, it says real-time blocking will be necessary.

To support the framework, DeepMind said it has already analyzed one million coding-agent tasks using an internal prototype that monitors asynchronous activity and routes flagged events to human reviewers. The company said that work helped it move beyond keyword-based filters and toward behavior-based detection.

DeepMind also said the internal data suggest most flagged events are not malicious. Instead, they more often stem from agent confusion, over-enthusiasm or misreading a user’s request.

Policy and ecosystem implications

Alongside the roadmap, Google is also publishing a separate technical framework for policymakers called Three Layers of Agent Security. That document focuses on individual agents, multi-agent systems and broader resilience for cyber defenders and the wider ecosystem.

DeepMind said the two publications reflect a shared responsibility across industry, government and academia as AI agents become more capable. The company said it intends to keep building on these approaches as it tries to deploy more advanced AI systems while maintaining stronger safeguards.