OpenAI o3 Deep Research helps uncover diagnoses in rare childhood disease cases

AI-assisted reanalysis finds answers in long-unsolved pediatric cases

Researchers from Boston Children’s Hospital, Harvard University and OpenAI report that an OpenAI reasoning model helped surface new diagnostic leads in dozens of previously unsolved rare-disease cases involving children. In a study published June 18 in NEJM AI, the team said the system identified candidate explanations that were later reviewed by clinicians, confirmed through additional testing and, in 18 cases, translated into diagnoses.

The work focused on 376 de-identified cases that had already undergone extensive genetic and clinical review without a clear answer. According to the researchers, those cases included children with neurodevelopmental conditions, people with rare neuromuscular disease, early psychosis cases and instances of sudden unexpected pediatric death. After expert review and clinical confirmation, the final diagnostic yield was 4.8 percent.

The study adds to growing interest in using AI not as a direct diagnostic tool, but as a way to help specialists revisit difficult cases as medical knowledge changes. Rare disease genetics is a moving target. A test that is inconclusive today can become informative later if a newly discovered gene, a reclassified variant or a fresh case report provides the missing context.

How the workflow worked

The team used OpenAI o3 Deep Research as a reasoning layer on top of existing genomic analysis. For each patient, researchers assembled a de-identified packet that included standardized phenotype terms, occasional clinician notes, age and sex metadata, and filtered variant data. The model was then asked to propose the most plausible molecular explanation and support its answer with evidence.

Importantly, the model did not make diagnoses or clinical decisions. Human experts reviewed each candidate using the same ACMG/AMP framework used in clinical genetics. A result counted only after qualified reviewers agreed, a variant was classified as pathogenic or likely pathogenic, a CLIA-certified laboratory confirmed it, and the finding was returned to the family through the clinical team.

Before applying the workflow to unsolved cases, the researchers tested it on cases with known outcomes. In those benchmark sets, the model frequently recovered the right gene and variant, but performance varied by cohort, and the authors emphasized that expert review remained essential.

Where the model helped

The model was able to connect clinical clues that had been scattered across records and databases. In one early-psychosis case, it inferred a likely 22q11.2 deletion from patterns in low-quality chromosome 22 calls and the child’s broader symptoms. Follow-up genome sequencing confirmed the structural change.

In other cases, the model suggested more than one gene when a single-gene answer did not fully explain the phenotype. The researchers also described cases in which the model pointed to possible phenotype expansion, meaning a gene may be linked to a broader set of symptoms than previously recognized.

One of the diagnoses had a particularly long timeline. A patient named Kyra spent nearly two decades without an explanation for progressive muscle weakness before the workflow connected her case to a frameshift variant in HSPB8. The diagnosis, a form of myofibrillar myopathy, came with some closure after years of uncertainty.

The researchers also described the model generating a testable hypothesis for a possible new relationship between an S1PR1 deletion and vitiligo. They stressed that this idea would need laboratory validation before it could be considered a real biological association.

Limits and implications

The authors said the study shows how a general-purpose reasoning model can help with retrospective genomic reanalysis by organizing phenotype data, inheritance patterns, variant annotations and literature into reviewable hypotheses. But they also cautioned against treating the system as a clinical decision-maker.

The study was retrospective, the cases were mixed, and the researchers did not measure whether the workflow saved time or reduced workload. They also did not assess every type of genetic variation that can cause disease. The results, they noted, still depended on human experts, clinical laboratories and standard confirmation processes.

Even so, the findings suggest that AI may be useful in one of rare disease medicine’s hardest problems, finding overlooked answers in cases that have already been reviewed many times. As Dr. Catherine Brownstein of Boston Children’s said in the report, the challenge is time. AI, the researchers argue, may help experts stretch that time a little further.