1. Introduction: A Breakthrough Decades in the Making
For the more than 300 million people worldwide living with rare diseases, the path to diagnosis has historically been a harrowing journey. Patients often endure an average of five years of repeated referrals, misdiagnoses, and unnecessary procedures before receiving an accurate diagnosis—a period researchers call the “diagnostic odyssey.” Now, a groundbreaking artificial intelligence system developed by Chinese researchers has achieved what was once thought impossible: consistently outperforming human experts in the complex task of rare disease diagnosis.
Published in the prestigious journal Nature in February 2026, the DeepRare system represents a paradigm shift in medical AI. Developed through a collaboration between Shanghai Jiao Tong University School of Medicine’s Xinhua Hospital and the Hunan Children’s Hospital affiliated with Xiangya School of Medicine, this multi-agent AI system achieved a milestone that could reshape how rare diseases are diagnosed globally.

2. The Rare Disease Diagnosis Crisis
Rare diseases—defined as conditions affecting fewer than 1 in 2,000 people—collectively impact more than 300 million people worldwide, with over 7,000 distinct disorders identified to date. Approximately 80% of these conditions have genetic origins. Despite their collective prevalence, rare diseases present unique diagnostic challenges that have long frustrated clinicians and devastated families.
The clinical heterogeneity of rare diseases means that individual conditions may manifest with vastly different symptoms across patients. A single rare disease might present with neurological symptoms in one patient and cardiac issues in another, making pattern recognition extremely difficult for even experienced physicians. Furthermore, the low individual prevalence of each condition means that most clinicians encounter only a handful of cases during their entire careers, if any at all.
This diagnostic crisis has profound consequences. Studies show that patients with rare diseases visit an average of seven physicians before receiving an accurate diagnosis. During this odyssey, patients undergo repeated unnecessary tests and procedures, face significant psychological burden from uncertainty, and often miss critical windows for early intervention that could significantly improve outcomes.

3. DeepRare: A Multi-Agent Diagnostic Revolution
DeepRare takes a fundamentally different approach to rare disease diagnosis. Rather than relying on traditional supervised learning models that require massive labeled datasets—an impossibility for ultra-rare conditions—the system uses a sophisticated multi-agent architecture powered by large language models.
The system’s three-tier architecture, inspired by Anthropic’s Model Context Protocol, comprises:
- Central Host: A reasoning-enhanced large language model (defaulting to DeepSeek-V3) equipped with a memory bank that orchestrates the entire diagnostic workflow by synthesizing collected evidence and managing the diagnostic process.
- Specialized Agent Servers: Multiple agents handling specific tasks including phenotype analysis, genotype interpretation, clinical data normalization, and knowledge retrieval from diverse medical databases.
- Heterogeneous Medical Knowledge Sources: Integration of over 40 specialized tools connecting to research articles, clinical guidelines, patient case repositories, and authoritative medical databases.
What sets DeepRare apart is its ability to process multi-modal patient data—free-text clinical descriptions, structured Human Phenotype Ontology (HPO) terms, and raw genetic sequencing data from VCF files—and generate ranked diagnostic hypotheses with transparent reasoning chains that clinicians can verify against primary medical literature.

4. Clinical Validation at Leading Chinese Hospitals
The research team conducted rigorous validation across an unprecedented scale: 6,401 clinical cases spanning 2,919 rare diseases across 14 medical specialties. The evaluation datasets were sourced from seven public databases and two major Chinese clinical centers, representing diverse populations across Asia, North America, and Europe.
Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine contributed 975 cases, including 168 with complete whole-exome sequencing (WES) data. The Hunan Children’s Hospital contributed 162 pediatric cases with full genetic testing results. These in-house datasets, containing real-world clinical data from active medical practice, provided crucial validation of the system’s practical applicability.
The diversity of evaluation datasets—ranging from well-documented literature cases to challenging real-world clinical presentations—demonstrates the robustness of DeepRare across varying diagnostic difficulty levels. Cases were categorized by source: research papers (typically easier due to clear documentation), case reports (moderate difficulty), and direct clinical encounters (most challenging and representative of real-world application).

5. Performance Milestone: Surpassing Human Experts
The most striking finding from the research is DeepRare’s performance against human experts. In a direct comparison study using 163 clinical cases from Xinhua Hospital, DeepRare was pitted against five experienced physicians, each with more than a decade of clinical practice in rare diseases. Both the physicians and DeepRare received identical inputs: structured HPO terms extracted from free-text outpatient narratives.
The results were unprecedented. DeepRare achieved a Recall@5 of 78.5%, significantly outperforming the clinicians’ average of 65.6%. At Recall@1—meaning the correct diagnosis appeared as the system’s top recommendation—DeepRare scored 64.4% compared to the physicians’ 54.6%. This represents the first documented instance of a computational system surpassing expert physician performance in rare disease phenotyping and diagnosis.
In HPO-based evaluations across all benchmarks, DeepRare achieved an average Recall@1 of 57.18%, surpassing the second-best method (Claude-3.7-Sonnet-thinking) by a substantial margin of 23.79%. At Recall@3, the system achieved 65.25%, outperforming competitors by 18.65%.
The system demonstrated particularly impressive results in specific datasets: 78% Recall@1 on the RareBench-MME evaluation (surpassing the second-best by 30%) and 74% on the MyGene2 evaluation (surpassing competitors by 35%).

6. Genetic Data Integration Transforms Accuracy
One of DeepRare’s most powerful capabilities is its ability to integrate genetic sequencing data with clinical phenotypes. When researchers combined HPO terms with whole-exome sequencing data, diagnostic accuracy improved dramatically—from 39.9% to 69.1% in the Xinhua Hospital dataset and from 33.3% to 63.6% in the Hunan Children’s Hospital dataset.
The system also outperformed Exomiser, a widely-used bioinformatics tool specifically designed for genetic variant interpretation. With combined HPO and genetic data, DeepRare achieved 69.1% Recall@1 compared to Exomiser’s 55.9% on the Xinhua cases, and 63.6% versus 58.0% on the Hunan cases.
This multi-modal capability is particularly significant because genetic testing has become increasingly common in rare disease workups. However, interpreting raw genomic data remains challenging for clinicians without specialized genetics training. DeepRare bridges this gap by automatically processing VCF files and integrating variant analysis with clinical phenotypes to generate more accurate diagnostic hypotheses.

7. Transparent Reasoning: Building Clinical Trust
Perhaps the most clinically significant feature of DeepRare is its transparent reasoning chain. Unlike black-box AI systems that provide diagnoses without explanation, DeepRare generates diagnostic hypotheses accompanied by explicit reasoning that references verifiable medical evidence—research articles, clinical guidelines, and patient case reports.
To validate the reliability of these reasoning chains, the research team engaged ten associate chief physicians specializing in rare diseases to evaluate the system’s outputs on 180 randomly sampled cases. Each case was independently reviewed by three specialists.
The results showed an average reference accuracy of 95.4%, meaning that the medical evidence cited by DeepRare was both reliable and directly relevant to the diagnostic conclusions in nearly all cases. This high level of factual accuracy is crucial for clinical adoption, as it allows physicians to verify the AI’s reasoning against primary sources and builds trust in the system’s recommendations.
The system also incorporates a self-reflective loop that iteratively reassesses hypotheses, helping to reduce over-diagnosis and mitigate hallucinations—a common problem with large language models. If initial hypotheses don’t meet validation criteria, the system can revisit earlier steps to gather additional patient-specific evidence.

8. Global Impact and Future Implications
The implications of DeepRare extend far beyond the two Chinese hospitals where it was validated. The system has been deployed as a user-friendly web application serving as a diagnostic copilot for rare disease physicians, and its open architecture means it can be adapted to healthcare settings worldwide.
For healthcare systems, particularly those in underserved regions with limited access to specialist physicians, DeepRare offers the potential to democratize rare disease diagnosis. The system can serve as an expert consultant, helping general practitioners and non-specialists identify rare conditions that would otherwise go undiagnosed for years.
The research team analyzed performance across 14 medical specialties—ranging from blood and circulation to reproductive systems—and found consistent superiority across almost all categories. The system performed best in kidney and urinary system disorders (66% accuracy) and showed strong results in endocrine (60%) and digestive system (49%) categories, demonstrating its broad applicability.
Looking forward, the multi-agent architecture provides a template for how AI systems can be designed to address complex medical challenges that require integration of diverse knowledge sources. As medical knowledge continues to expand—with approximately 260 to 280 new rare genetic diseases discovered annually according to the International Rare Diseases Research Consortium—systems like DeepRare that can efficiently incorporate new information will become increasingly valuable.
The DeepRare achievement represents not just a technological breakthrough, but a fundamental shift in how AI can be deployed in clinical practice. By combining the reasoning capabilities of large language models with transparent, verifiable evidence chains, the system demonstrates that AI can be both highly accurate and clinically trustworthy—a combination that has long been the holy grail of medical artificial intelligence.
For the millions of patients currently navigating their own diagnostic odysseys, DeepRare offers hope that the five-year journey to diagnosis may soon become a thing of the past.

Sources and References
- Zhao W, Wu C, Fan Y, et al. “An agentic system for rare disease diagnosis with traceable reasoning.” Nature 651, 775–784 (2026).
- Nguengang Wakap S, et al. “Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database.” European Journal of Human Genetics 28, 165–173 (2020).
- Schieppati A, Henter JI, Daina E, Aperia A. “Why rare diseases are an important medical and social issue.” The Lancet 371, 2039–2041 (2008).
- Smedley D, et al. “Next-generation diagnostics and disease-gene discovery with the Exomiser.” Nature Protocols 10, 2004–2015 (2015).
- Mao X, et al. “A phenotype-based AI pipeline outperforms human experts in differentially diagnosing rare diseases using EHRs.” npj Digital Medicine 8, 68 (2025).