Gray Swan research spotlights indirect prompt injection risks in AI agents

Gray Swan says AI security needs its own playbook

Gray Swan cofounders Zico Kolter and Matt Fredrikson are arguing that AI security should not be treated as a simple extension of cybersecurity. In a recent conversation tied to the company’s red-teaming work, the pair described how modern AI systems introduce their own vulnerabilities, especially as they are connected to tools, browser workflows, and enterprise data.

Kolter, who serves on OpenAI’s board, and Fredrikson, a Carnegie Mellon professor and Gray Swan’s CEO, said their work focuses on testing how models behave when they encounter untrusted inputs. That includes the growing problem of indirect prompt injection, where malicious instructions are hidden inside content a model is asked to process, such as web pages, documents, or code. The concern is not that the model is tricked in a traditional software sense, but that it is persuaded to abandon its original task and follow attacker-controlled instructions instead.

Fredrikson said this challenge becomes more serious as companies build autonomous systems on top of large language models. Once an AI agent can read outside content, call tools, and act on behalf of a user, it can expose data, misuse credentials, or make incorrect tool calls. Gray Swan’s work is aimed at finding those weaknesses before attackers do.

Red teaming models and agents

A large part of Gray Swan’s business centers on adversarial testing. The company runs an online arena that brings together thousands of participants in red-teaming competitions, where people are rewarded for finding ways around model safeguards. According to Fredrikson, the community provides a steady stream of signals to model developers about where their systems fail.

Gray Swan also trains specialized models for automated red teaming. Those systems are designed to probe both basic chat models and more complex agents. Fredrikson said the company is still able to uncover failures, including jailbreaks and indirect prompt injection attacks, even as frontier labs improve their defenses.

Kolter and Fredrikson said this matters because the industry is now seeing a small number of widely used models and agent frameworks. If a vulnerability affects popular systems such as coding assistants, the consequences could spread quickly across many users and enterprises.

Why frontier models are not automatically safer

The Gray Swan executives pushed back on the idea that larger or more capable models are naturally more secure. In their view, scale does not automatically solve robustness problems. Instead, new capabilities can create new attack surfaces, particularly in systems that act on a user’s behalf.

They also discussed the limits of relying only on better prompting or basic policy settings. As AI deployments become more autonomous, they said organizations need guardrails, permissions, and identity systems that are built with agents in mind. Gray Swan has products aimed at that market, including its guardrail offering Cygnal and its red-teaming platform Shade.

The company’s framing reflects a broader shift in the industry. What started as model evaluation is moving toward a larger security stack that includes testing, policy enforcement, and operational controls. Gray Swan’s leaders suggested that insurance and compliance could eventually become part of that ecosystem as enterprises look for ways to manage AI risk.

A visible but uncertain threat

Kolter and Fredrikson described the next major AI security incident as something like a gray swan, a serious event that many people can already see coming even if the timing is uncertain. Their warning is not that a large breach has already happened, but that the ingredients are in place: widely deployed agents, untrusted inputs, and systems that can take meaningful actions.

For now, Gray Swan is betting that the best defense is continued red teaming, more rigorous evaluation, and security tools designed specifically for AI systems. The company’s message is that organizations adopting agents should assume attackers will probe them, and should prepare accordingly.