Illumina has unveiled what it calls the largest genome-wide genetic perturbation dataset ever created, aimed at boosting AI-driven drug discovery across the pharmaceutical industry. The new Billion Cell Atlas represents the first phase of a planned five-billion-cell resource that the company expects to complete over the next three years. Illumina says the project will eventually become the most comprehensive map of human disease biology ever generated.
The initiative is being built in collaboration with AstraZeneca, Merck, and Eli Lilly, who are participating as founding partners. Together, the companies are generating a curated set of cell lines to validate drug targets, train AI models at scale, and study biological mechanisms that have been historically difficult to analyze.
CEO: Scaling AI for Drug Discovery
Jacob Thaysen, PhD, Illumina’s CEO, said the scale of the initiative is meant to transform how AI is applied in early drug discovery. “We believe the cell atlas is a key development that will enable us to significantly scale AI for drug discovery,” he said. “We are building an unparalleled resource for training the next generation of AI models for precision medicine and drug target identification.”
How Pharma Partners Plan to Use the Atlas
Merck intends to leverage the dataset to advance its precision medicine approaches across drug discovery pipelines and AI/ML models. According to Iya Khalil, PhD, the company aims to train proprietary foundation models and develop virtual cell models that improve disease-indication prediction. “By harnessing advanced genomic patient datasets, Merck scientists are building and leveraging AI models grounded in real biological variation—not just literature text,” Khalil said.
AstraZeneca sees the Atlas as a way to connect genetic signals to actionable biology. Slavé Petrovski, PhD, emphasized that translating genetic information into a clear understanding of disease mechanisms is a core R&D challenge. “By showing how specific genetic perturbations play out inside human cells, we can turn genetic signals into mechanistic biology that can be directly studied,” he said.
Eli Lilly highlighted the importance of scale for next-generation AI models. Ruth Gimeno, PhD, noted that comprehensive datasets spanning diverse cell types provide the foundation needed to generate meaningful insights into human disease.
Inside the Billion Cell Atlas
The Atlas tracks how one billion individual cells respond to CRISPR-based perturbations across more than 200 disease-relevant cell lines, covering immune, oncologic, cardiometabolic, neurological, and rare genetic conditions. By systematically turning genes on or off, researchers can observe functional effects at single-cell resolution, helping characterize mechanisms, validate targets, and identify new indications.
The project is the first major data product from Illumina’s new BioInsight business unit. Using its Single Cell 3’ RNA prep platform, the company captures millions of cells per experiment and expects roughly 20 petabytes of transcriptomic data within the first year. Data are processed through the DRAGEN pipeline and hosted on Illumina’s Connected Analytics cloud platform for large-scale analysis.
Illumina says the Billion Cell Atlas is the first step in building multi-billion-cell datasets over time, ultimately contributing to the five-billion-cell resource announced last year.


