Security Researcher Tests Whether LLMs Can Find a Vulnerability in a Mock App

A costly experiment in AI security testing

A security researcher says he spent about $1,500 running large language models through a deliberately vulnerable app to see whether they could identify and exploit a common class of cloud access flaw. The test focused on whether models could find a weakness involving Firebase, even when the app’s main backend API was built to look secure.

The researcher created a fake book review app using React Native Expo for the front end and Python for the backend. The objective for the models was to locate a hidden flag inside private user reviews. To do that, they had to recognize that the app exposed Firebase credentials in a bundled file and then use Firebase directly to access Firestore data.

The setup mirrors a real-world issue the researcher said he has seen in multiple apps. In those cases, developers hardened the main API but left Firebase or similar services accessible in ways that bypassed intended permissions. He described the flaw as a form of broken access control or missing object-level authorization.

Mixed results across model families

The test was not framed as a formal benchmark. The author said the runs were expensive enough that he stopped after spending around $1,500, and that the results should be treated as an informal comparison rather than a scientific evaluation. He also noted that he excluded failed or partial runs from part of the analysis.

Among the models tested, GPT 5.5 performed best, solving the challenge in 7 of 10 runs. According to the researcher, it typically moved quickly toward Firebase after unpacking the app and did not spend much time chasing weaknesses in the API or app code.

DeepSeek V4 Pro solved the task in 3 of 10 runs, though several of its attempts focused only on the backend API or mobile app without fully shifting attention to Firebase. Claude Sonnet 4.6 and Claude Opus 4.8 each solved it in 2 of 10 runs. The researcher said Sonnet often followed the right general path before hitting budget limits, while Opus came close several times but was stopped by safety behavior late in the process.

Other models fared worse. DeepSeek V4 Flash and several Gemini variants did not solve the challenge. Gemini 3.1 Pro Preview frequently refused the task for security reasons. MiniMax M2.7 and Step 3.7 Flash also failed to produce a valid exploit, although Step 3.7 Flash was said to document the API well and, in some runs, mistakenly report success.

Patterns in how the models reasoned

The researcher’s notes suggest the models split into a few camps. Some got stuck probing the API for IDOR-style issues and never explored the Firebase layer. Others identified Firebase but tried to use its credentials through the API rather than connecting to Firebase directly. A smaller group appeared to understand the intended path but either ran out of budget or hit guardrails before finishing.

The author also documented a number of practical problems with the experiment itself. Some provider APIs were unstable, causing runs to fail and requiring restarts. He said using a cloud runner system added another source of errors, including preemptions that interrupted jobs. The harness that coordinated the agents, he wrote, was harder to build than the challenge app.

He concluded that the test showed how differently models approach security problems, especially when an exploit requires moving beyond the obvious target and into the underlying data layer. He also said he now wants to avoid spending that much on similar experiments in the future.

The researcher said he shared the challenge materials for others to try, inviting people to run their own models against the same app and compare results.