Π-cyc

A Reference-free SNP Discovery Application using Parallel Graph Search

Research output: Contribution to journalArticleGeneral public

Abstract

Motivation: Working with a large number of genomes simultaneously is of great interest in genetic population and comparative genomics research. Bubbles discovery in multi-genomes coloured de bruijn graph for de novo genome assembly is a problem that can be translated to cycles enumeration in graph theory. Cycle enumerations algorithms in big and complex de Bruijn graphs are time consuming. Specialised fast algorithms for efficient bubble search are needed for coloured de bruijn graph variant calling applications. In coloured de Bruijn graphs, bubble paths coverages are used in downstream variants calling analysis. Results: In this paper, we introduce a fast parallel graph search for different K-mer cycle sizes. Coloured path coverages are used for SNP prediction. The graph search method uses a combined multi-node and multi-core design to speeds up cycles enumeration. The search algorithm uses an index extracted from the raw assembly of a coloured de Bruijn graph stored in a hash table. The index is distributed across different CPU-cores, in a shared memory HPC compute node, to build undirected subgraphs then search independently and simultaneously specific cycle sizes. This same index can also be split between several HPC compute nodes to take advantage of as many CPU-cores available to the user. The local neighbourhood parallel search approach reduces the graph's complexity and facilitate cycles search of a multi-colour de Bruijn graph. The search algorithm is incorporated into -cyc application and tested on a number of Schizosaccharomyces Pombe genomes.
Original languageEnglish
JournalarXiv.org
ISSN2331-8422
DOIs
Publication statusUnpublished - 29 Jan 2019
MoE publication typeE1 Popularised article, newspaper article

Fields of Science

  • 113 Computer and information sciences
  • Bionformatics
  • SNP calling
  • De Bruijn graph
  • population genomics

Cite this

@article{0eb89709b4fe4fe59c1c7c27da3ece25,
title = "Π-cyc: A Reference-free SNP Discovery Application using Parallel Graph Search",
abstract = "Motivation: Working with a large number of genomes simultaneously is of great interest in genetic population and comparative genomics research. Bubbles discovery in multi-genomes coloured de bruijn graph for de novo genome assembly is a problem that can be translated to cycles enumeration in graph theory. Cycle enumerations algorithms in big and complex de Bruijn graphs are time consuming. Specialised fast algorithms for efficient bubble search are needed for coloured de bruijn graph variant calling applications. In coloured de Bruijn graphs, bubble paths coverages are used in downstream variants calling analysis. Results: In this paper, we introduce a fast parallel graph search for different K-mer cycle sizes. Coloured path coverages are used for SNP prediction. The graph search method uses a combined multi-node and multi-core design to speeds up cycles enumeration. The search algorithm uses an index extracted from the raw assembly of a coloured de Bruijn graph stored in a hash table. The index is distributed across different CPU-cores, in a shared memory HPC compute node, to build undirected subgraphs then search independently and simultaneously specific cycle sizes. This same index can also be split between several HPC compute nodes to take advantage of as many CPU-cores available to the user. The local neighbourhood parallel search approach reduces the graph's complexity and facilitate cycles search of a multi-colour de Bruijn graph. The search algorithm is incorporated into -cyc application and tested on a number of Schizosaccharomyces Pombe genomes.",
keywords = "113 Computer and information sciences, Bionformatics, SNP calling, De Bruijn graph, population genomics",
author = "Reda Younsi and Jing Tang and Holm, {Liisa Ulrika Teodora}",
year = "2019",
month = "1",
day = "29",
doi = "https://arxiv.org/abs/1809.06700",
language = "English",
journal = "arXiv.org",
issn = "2331-8422",
publisher = "Cornell University",

}

TY - JOUR

T1 - Π-cyc

T2 - A Reference-free SNP Discovery Application using Parallel Graph Search

AU - Younsi, Reda

AU - Tang, Jing

AU - Holm, Liisa Ulrika Teodora

PY - 2019/1/29

Y1 - 2019/1/29

N2 - Motivation: Working with a large number of genomes simultaneously is of great interest in genetic population and comparative genomics research. Bubbles discovery in multi-genomes coloured de bruijn graph for de novo genome assembly is a problem that can be translated to cycles enumeration in graph theory. Cycle enumerations algorithms in big and complex de Bruijn graphs are time consuming. Specialised fast algorithms for efficient bubble search are needed for coloured de bruijn graph variant calling applications. In coloured de Bruijn graphs, bubble paths coverages are used in downstream variants calling analysis. Results: In this paper, we introduce a fast parallel graph search for different K-mer cycle sizes. Coloured path coverages are used for SNP prediction. The graph search method uses a combined multi-node and multi-core design to speeds up cycles enumeration. The search algorithm uses an index extracted from the raw assembly of a coloured de Bruijn graph stored in a hash table. The index is distributed across different CPU-cores, in a shared memory HPC compute node, to build undirected subgraphs then search independently and simultaneously specific cycle sizes. This same index can also be split between several HPC compute nodes to take advantage of as many CPU-cores available to the user. The local neighbourhood parallel search approach reduces the graph's complexity and facilitate cycles search of a multi-colour de Bruijn graph. The search algorithm is incorporated into -cyc application and tested on a number of Schizosaccharomyces Pombe genomes.

AB - Motivation: Working with a large number of genomes simultaneously is of great interest in genetic population and comparative genomics research. Bubbles discovery in multi-genomes coloured de bruijn graph for de novo genome assembly is a problem that can be translated to cycles enumeration in graph theory. Cycle enumerations algorithms in big and complex de Bruijn graphs are time consuming. Specialised fast algorithms for efficient bubble search are needed for coloured de bruijn graph variant calling applications. In coloured de Bruijn graphs, bubble paths coverages are used in downstream variants calling analysis. Results: In this paper, we introduce a fast parallel graph search for different K-mer cycle sizes. Coloured path coverages are used for SNP prediction. The graph search method uses a combined multi-node and multi-core design to speeds up cycles enumeration. The search algorithm uses an index extracted from the raw assembly of a coloured de Bruijn graph stored in a hash table. The index is distributed across different CPU-cores, in a shared memory HPC compute node, to build undirected subgraphs then search independently and simultaneously specific cycle sizes. This same index can also be split between several HPC compute nodes to take advantage of as many CPU-cores available to the user. The local neighbourhood parallel search approach reduces the graph's complexity and facilitate cycles search of a multi-colour de Bruijn graph. The search algorithm is incorporated into -cyc application and tested on a number of Schizosaccharomyces Pombe genomes.

KW - 113 Computer and information sciences

KW - Bionformatics

KW - SNP calling

KW - De Bruijn graph

KW - population genomics

U2 - https://arxiv.org/abs/1809.06700

DO - https://arxiv.org/abs/1809.06700

M3 - Article

JO - arXiv.org

JF - arXiv.org

SN - 2331-8422

ER -