June 18, 2026, (Inside AI) — A new study shows that an AI reasoning model helped physicians diagnose rare genetic diseases in 18 previously unsolved cases, adding a 4.8% diagnostic yield after years of expert analysis failed. Researchers from Boston Children's Hospital, Harvard University, and OpenAI used the OpenAI o3 Deep Research model to reanalyze de-identified data from 376 cases, surfacing evidence-linked hypotheses for clinicians to review.
The findings, published today in NEJM AI, demonstrate how AI can help experts generate leads when revisiting difficult cases, not by diagnosing patients but by connecting scattered clues across genomes, clinical records, and scientific literature. Every diagnosis was made by qualified physicians following standard confirmation processes.
Why Old Data Hides New Answers
Rare-disease diagnosis is a moving target. A patient's genome may stay static, but knowledge around it evolves: new gene-disease links emerge, variants get reclassified, and case reports accumulate. This creates a maintenance backlog, as institutions must periodically reanalyze old genomes against fresh evidence.
In this study, the AI acted as an explanation-first reasoning layer atop existing genomic pipelines. It was tasked with connecting clinical features, inheritance patterns, variant evidence, and literature into justifications a human reviewer could interrogate. For each case, researchers assembled de-identified packets with standardized phenotype terms, variant tables, and metadata.
The model proposed plausible molecular explanations and showed its work. At least two experts then reviewed each candidate using the ACMG/AMP framework. A finding counted as a diagnosis only after expert consensus, pathogenic classification, lab confirmation, and return of results to the family.
Testing the Workflow on Known Cases
Before tackling unsolved cases, the team refined the workflow on cases with established diagnoses. The model recovered the correct gene and variant in duplicate runs for 48 of 51 varied rare conditions. In 57 neuromuscular cases, it succeeded in 45. For 15 long-read genome cases, it named the correct gene in all and both disease-causing alleles in 12.
Model confidence scores tracked with accuracy: a mean minimum of 85.6 for consistent correct calls versus 42.1 for incorrect ones. These scores were not calibrated probabilities but helped experts prioritize candidates.
Surfacing Hidden Diagnoses
Applied to four unsolved cohorts—neurodevelopmental, neuromuscular, early psychosis, and sudden pediatric death—the workflow yielded 18 diagnoses. Seven were rediscoveries, where known pathogenic variants had been missed due to fragmented records. This highlights the operational challenge of synthesizing data across silos.
In one early-psychosis case, the model inferred a structural variant not in the input data. It connected low-quality calls on chromosome 22 with cardiac, immune, and psychiatric features, hypothesizing a 22q11.2 deletion linked to DiGeorge syndrome. Follow-up sequencing confirmed it.
Sometimes the model surfaced two genes better explaining complex presentations, such as LAMA2 and FOXP1 in one case, or a digenic explanation involving TTN and SRPK3 in another.
From Scattered Clues to Testable Hypotheses
The model also identified a possible novel mechanism for vitiligo. In a neurodevelopmental case, it highlighted an 11-amino-acid deletion in S1PR1, a receptor involved in immune cell movement. It integrated evidence suggesting the deletion could alter signaling, reducing pigment production while helping immune cells persist in skin.
This proposed S1PR1-vitiligo link requires experimental validation but shows AI's power to translate scattered findings into concrete hypotheses. The team also saw possible phenotype expansion: damaging variants in HSPB8 and CDK13 didn't perfectly match known disorders, hinting at broader clinical spectra.
Limits and Next Steps
The study was retrospective, cohorts were heterogeneous, and reviewers weren't blinded to model confidence. Researchers didn't measure time saved, cost, or false-positive workload. The model didn't systematically evaluate structural variants, repeat expansions, or mosaicism.
Every result passed through human adjudication. The model widened the search and focused expert analysis; it didn't decide what to return to families. Broader deployment will need privacy, security, and regulatory safeguards.
Prospective studies should compare AI-assisted reanalysis with standard practice on diagnostic yield, time, effort, and cost. Newer models like GPT‑Rosalind may offer deeper life-sciences capabilities but require separate evaluation.
The Manton Center will lead next-stage work through an OpenAI Foundation grant, aiming to build a platform-agnostic, low-cost genetics AI copilot. The long-term promise is not AI replacing diagnosis, but helping specialists identify evidence worth investigating—so unanswered questions don't stay that way forever.