Prompt injection may be a role problem

A new technical writeup argues that prompt injection is not just a matter of clever adversarial text, but a deeper flaw in how large language models interpret roles. The authors say models often fail to reliably separate their own reasoning, human instructions, and external data, creating openings for attacks that can hijack agent behavior.

The post frames the issue around how an LLM experiences the world. To a human, a chat interface presents distinct turns from a user, the assistant, and tools. But the model itself receives a single stream of tokens that includes all of it. In that format, the boundaries between a user message, a tool output, and the model’s own prior reasoning are not naturally obvious. Role tags are used to impose structure on that stream, telling the model what kind of text it is processing.

According to the writeup, those tags do far more than label content. They act as a kind of trust system. User text is meant to be treated as instructions, tool output as external information, and internal reasoning as something the model should rely on. The authors say this system has become overloaded, carrying signals about authority, identity, trust, and whether text should be treated as private reasoning or public content.

Why prompt injection succeeds

The paper argues that prompt injection happens when low-privilege text is mistakenly treated as if it had higher privilege. That can happen when an agent reads a webpage or other tool output that contains hidden instructions. Even though the content is wrapped in a tool role and should be treated as data, the model may interpret the embedded text as a legitimate user request.

The writeup says this helps explain why standard benchmarks often overstate defenses. Models may perform well on static tests because they have memorized common attacks. Human red-teamers, by contrast, can vary wording until an attack works. In that setting, simple memorization is not enough.

The authors contrast memorization with what they call role perception. In their view, a robust defense would require the model to recognize that text in a lower-privilege role cannot issue commands, regardless of how persuasive it sounds. Their central claim is that current models do not reliably do this.

Probing how models assign roles

To study the problem, the researchers built what they call role probes. These probes are designed to measure how strongly a model internally associates a token with a given role, such as think or user. They trained the probes on text snippets that were wrapped in different role tags, keeping the content constant so that any differences in the model’s representations would come from the tags themselves.

Their experiments suggest that role recognition is not as clean as the tag system implies. In one test, text that had originally been marked as reasoning still showed strong reasoning-related signals even after the tags were removed. In another, the same effect persisted even when all the text was wrapped as user content. The authors say this indicates that writing style can trigger the same internal behavior as the formal tag.

That matters because it suggests models may infer role from style, not only from the secure tag structure. The writeup compares that to judging someone’s job by appearance or speech instead of checking identification. When the style and the tag disagree, the model may follow the less reliable signal.

An attack that imitates reasoning

The post also describes a related attack called CoT Forgery, short for chain-of-thought forgery. In that approach, an attacker inserts fake reasoning into a prompt or tool output, making it look like the model has already reasoned through and accepted a dangerous request. The authors say this can increase jailbreak success because the model treats the forged reasoning like its own prior conclusion.

They report that the technique raised attack success rates on a jailbreak benchmark and transferred across multiple models tested. The authors argue that this is more worrying than ordinary persuasion-based jailbreaks because the model is not being asked to reconsider a claim. Instead, it is being nudged into accepting what appears to be its own thinking.

The broader message of the writeup is that AI security work needs to take roles seriously as a research problem of their own. If models cannot reliably distinguish between instruction, data, and self-generated reasoning, then agents will remain vulnerable to attacks that exploit that confusion.