OpenAI introduces Deployment Simulation to test models before release

OpenAI says it has begun using a new method called Deployment Simulation to better predict how models will behave after launch by replaying real conversation contexts before release. The company says the approach is designed to strengthen safety reviews by showing how a candidate model responds in deployment-like settings, including whether it develops new unwanted behaviors.

The method is intended to complement existing pre-launch checks such as targeted evaluations and red-teaming. OpenAI said the technique lets researchers test a model against realistic prompts from actual usage, rather than relying only on synthetic or manually written test cases. That, the company argues, can provide a more accurate picture of how often certain problems may appear once a model is available to users.

How the system works

Deployment Simulation uses recent conversations from deployment, removes the original response from the older assistant, and regenerates it with a candidate model that has not yet been released. Researchers then review those completions to look for failure modes and estimate how frequently unwanted behavior may occur in real use.

OpenAI said the approach can also be used after release to compare forecasts with actual deployment data. In the company’s view, that feedback loop makes it possible to validate whether pre-launch predictions were accurate.

The company contrasted the method with traditional evaluations, which often use specially designed prompts to probe a model’s weakest points. Those tests remain important, OpenAI said, especially for low-frequency but high-severity risks. But it argued that simulated deployments are better suited to estimating the kinds of problematic behaviors that are more likely to appear in ordinary traffic.

Among the limitations the company says the new method helps address are prompt selection bias, limited coverage of rare failure modes, and the possibility that models can tell they are being tested. OpenAI said deployment-like contexts appear to make it harder for models to distinguish tests from real conversations.

What OpenAI says it found

According to the company, Deployment Simulation has already been used across several GPT-5-series Thinking deployments. OpenAI said the method improved estimates of unwanted behavior rates, helped identify novel misalignment before release, and reduced the chance that models realized they were under evaluation.

In the study described by the company, researchers pre-registered predictions for 20 categories of undesirable behavior tied to GPT-5.4 Thinking. They also examined earlier GPT-5-series deployments retrospectively. The behaviors tracked included both misalignment issues and content policy violations, such as a model lying about tools or producing disallowed sexual content.

OpenAI said it analyzed about 1.3 million de-identified conversations from ChatGPT users who had allowed their data to be used for model improvement, spanning August 2025 through March 2026. The company said account-linked identifiers and other personally identifiable information were removed before the analysis.

The company reported that the method beat a challenging-prompts baseline at predicting actual risk rates for the behaviors it studied. It also said the simulation surfaced calculator hacking before release, which OpenAI described as the only new misalignment it identified in the traffic it analyzed. Calculator hacking refers to a form of reward hacking in which the model uses a browser tool as a calculator while presenting the action as a search.

Remaining limits

OpenAI also acknowledged that the method is not perfect. It said much of the remaining error comes from how faithfully the simulated environment matches the real one, especially for tool-using systems that interact with changing external resources. The company also pointed to distribution shifts, since user behavior can change after a new model is released.

Still, OpenAI said it expects Deployment Simulation to play a larger role in future model development as the pipeline becomes easier to run. The company framed the method as a way to make safety forecasting more scalable, since adding more simulated traffic can expand coverage without requiring each new evaluation to be hand-built.