SparkBeagle: Scalable Genotype Imputation from Distributed Whole-Genome Reference Panels in the Cloud

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review


Massive whole-genome genotype reference panels now provide accurate and fast genotyping by imputation for high-resolution genome-wide association (GWA) studies. Imputation-assisted genotyping can increase the genomic coverage of genotypes and thus satisfy the resolution required in comprehensive GWA studies in a cost-effective manner. However, the imputation of missing genotypes from large reference panels is a compute-intensive process that requires high-performance computing (HPC). Although HPC uses extremely distributed and parallel computing, current imputation tools, and existing algorithms have not been developed to fully exploit the power of distributed computing. To this end, we have developed SparkBeagle, a scalable, fast, and accurate distributed genotype imputation tool based on popular Beagle software. SparkBeagle is designed for HPC and cloud computing environments and it is implemented on top of the Apache Spark distributed computing framework. We have carried out scalability experiments by imputing 64,976,316 variants of 2504 samples from the 1000 Genomes reference panel in the cloud. SparkBeagle shows near-linear scalability while increasing the number of computing nodes. A speedup of 30x was achieved with 40 nodes. The imputation time of the whole data set decreased from 565 minutes to 18 minutes compared to a single node parallel execution. Near identical imputation accuracy was measured in the concordance analysis between the original Beagle and the distributed SparkBeagle tool.
Original languageEnglish
Title of host publicationBCB '20: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
Number of pages8
Publication dateSept 2020
Article number97
ISBN (Electronic)978-1-4503-7964-9
Publication statusPublished - Sept 2020
MoE publication typeA4 Article in conference proceedings
EventACM Conference on Bioinformatics, Computational Biology, and Health Informatics - Virtual
Duration: 21 Sept 201924 Sept 2020
Conference number: 11

Fields of Science

  • 113 Computer and information sciences

Cite this