Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny

Martin Hunt, Angie S Hinrichs, Daniel Anderson, Lily Karim, Bethany L Dearlove, Jeff Knaggs, Bede Constantinides, Philip W Fowler, Gillian Rodger, Teresa L Street, Sheila F Lumley, Hermione Webster, Theo Sanderson, Christopher Ruis, Nicola De Maio, Lucas N Amenga-Etego, Dominic SY Amuzu, Martin Avaro, Gordon A Awandare, Reuben Ayivor-DjanieMatthew Bashton, Elizabeth M Batty, Yaw Bediako, Denise De Belder, Estefania Benedetti, Andreas Bergthaler, Stefan A Boers, Josefina Campos, Rosina Afua Ampomah Carr, Facundo Cuba, Maria Elena Dattero, Wanwissa Dejnirattisai, Alexander T Dilthey, Kwabena Obeng Duedu, Lukas Endler, Ilka Engelmann, Ngiambudulu M Francisco, Jonas Fuchs, Etienne Gnimpieba Z., Soraya Groc, Jones Gyamfi, Dennis Heemskerk, Torsten Houwaart, Nei-yuan Hsiao, Matthew Huska, Martin Hoelzer, Arash Iranzadeh, Hanna Jarva, Chandima Jeewandara, Bani Jolly, Rageema Joseph, Ravi Kant, Karrie Ko Kwan Ki, Satu Kurkela, Maija Lappalainen, Marie Lataretu, Chang Liu, Gathsaurie Neelika Malavige, Tapfumanei Mashe, Juthathip Mongkolsapaya, Brigitte Montes, Jose Arturo Molina-Mora, Collins M Morang'a, Bernard Mvula, Niranjan Nagarajan, Andrew Nelson, Joyce Mwongeli Ngoi, Joana Paula da Paixao, Marcus Panning, Tomas Poklepovich, Peter Kojo Quashie, Diyanath Ranasinghe, Mara Russo, James E San, Nicholas D Sanderson, Vinod Scaria, Gavin Screaton, Tarja Sironen, Abay Sisay, Darren Smith, Teemu Smura, Piyada Supasa, Chayaporn Suphavilai, Jeremy Swann, Houriiyah Tegally, Bryan Tegomoh, Olli Vapalahti, Andreas Walker, Robert Wilkinson, Carolyn Williamson, IMSSC2 Laboratory Network Consortium, Tulio de Oliveira, Timothy EA Peto, Derrick Crook, Russ Corbett-Detig, Zamin Iqbal

Forskningsoutput: TidskriftsbidragArtikelVetenskaplig

Sammanfattning

The SARS-CoV-2 genome occupies a unique place in infection biology -- it is the most highly sequenced genome on earth (making up over 20% of public sequencing datasets) with fine scale information on sampling date and geography, and has been subject to unprecedented intense analysis. As a result, these phylogenetic data are an incredibly valuable resource for science and public health. However, the vast majority of the data was sequenced by tiling amplicons across the full genome, with amplicon schemes that changed over the pandemic as mutations in the viral genome interacted with primer binding sites. In combination with the disparate set of genome assembly workflows and lack of consistent quality control (QC) processes, the current genomes have many systematic errors that have evolved with the virus and amplicon schemes. These errors have significant impacts on the phylogeny, and therefore over the last few years, many thousands of hours of researchers time has been spent in "eyeballing" trees, looking for artefacts, and then patching the tree. Given the huge value of this dataset, we therefore set out to reprocess the complete set of public raw sequence data in a rigorous amplicon-aware manner, and build a cleaner phylogeny. Here we provide a global tree of 3,960,704 samples, built from a consistently assembled set of high quality consensus sequences from all available public data as of March 2023, viewable at https://viridian.taxonium.org. Each genome was constructed using a novel assembly tool called Viridian (https://github.com/iqbal-lab-org/viridian), developed specifically to process amplicon sequence data, eliminating artefactual errors and mask the genome at low quality positions. We provide simulation and empirical validation of the methodology, and quantify the improvement in the phylogeny. Phase 2 of our project will address the fact that the data in the public archives is heavily geographically biased towards the Global North. We therefore have contributed new raw data to ENA/SRA from many countries including Ghana, Thailand, Laos, Sri Lanka, India, Argentina and Singapore. We will incorporate these, along with all public raw data submitted between March 2023 and the current day, into an updated set of assemblies, and phylogeny. We hope the tree, consensus sequences and Viridian will be a valuable resource for researchers.Competing Interest StatementGavin Screaton sits on the GSK Vaccines Scientific Advisory Board, consults for AstraZeneca, and is a founding member of RQ Biotechnology.
Originalspråkengelska
TidskriftbioRxiv
DOI
StatusPublicerad - 30 apr. 2024
MoE-publikationstypB1 Artikel i en vetenskaplig tidskrift

Citera det här