Algorithms and Data Structures for Sequence Analysis in the Pan-Genomic Era

Forskningsoutput: AvhandlingDoktorsavhandlingSamling av artiklar

Sammanfattning

This thesis is motivated by two important processes in bioinformatics, namely variation calling and haplotyping. The contributions range from basic algorithms for sequence analysis, to the implementation of pipelines to deal with real data. Variation calling characterizes an individual's genome by identifying how it differs from a reference genome. It uses reads -- small DNA fragments -- extracted from a biological sample, and aligns them to the reference to identify the genetic variants present in the donor's genome. A related procedure is haplotype phasing. Sexual organisms have their genome organized in two sets of chromosomes, with equivalent functions. Each set is inherited from the mother and the father respectively, and its elements are called haplotypes. The haplotype phasing problem is, once genetic variants are discovered, to attribute them to either of the haplotypes. The first problem we consider is to efficiently index large collections of genomes. The Lempel-Ziv compression algorithms is a useful tool for this. We focus on two of its exponents, namely the RLZ and LZ77 algorithms. We analyze the first, and propose some modifications to both, to finally develop a scalable index for large and repetitive collections. Then, using that index, we propose a novel pipeline for variation calling to replace the single reference by thousands of them. We test our variation calling pipeline on a mutation-rich subsequence of a Finnish population genome. Our approach consistently outperforms the single-reference approach to variation calling. The second part of this thesis revolves around the haplotype phasing problem. First, we propose a generalization of sequence alignment for diploid genomes. Next we extend this model to offer a solution for the haplotype phasing problem in the family-trio setting (that is, when we know the variants present in an individual and in her parents). Finally, in the context of an existing read-based approach to haplotyping, we go back to basic algorithms, where we model the problem of pruning a set of reads aligned to a reference as an interval scheduling problem. We propose a exact solution that runs in subquadratic time and a 2-approximation algorithm that runs in linearithmic time.
Originalspråkengelska
Tilldelande institution
  • Helsingfors universitet
Handledare
  • Mäkinen, Veli, Handledare
Tilldelningsdatum9 jun 2017
UtgivningsortHelsinki
Förlag
Tryckta ISBN978-951-51-3230-7
Elektroniska ISBN978-951-51-3231-4
StatusPublicerad - 9 jun 2017
MoE-publikationstypG5 Doktorsavhandling (artikel)

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap

Citera det här

@phdthesis{7bf5d3ff8d78430b88990a6cf783c9a6,
title = "Algorithms and Data Structures for Sequence Analysis in the Pan-Genomic Era",
abstract = "This thesis is motivated by two important processes in bioinformatics, namely variation calling and haplotyping. The contributions range from basic algorithms for sequence analysis, to the implementation of pipelines to deal with real data. Variation calling characterizes an individual's genome by identifying how it differs from a reference genome. It uses reads -- small DNA fragments -- extracted from a biological sample, and aligns them to the reference to identify the genetic variants present in the donor's genome. A related procedure is haplotype phasing. Sexual organisms have their genome organized in two sets of chromosomes, with equivalent functions. Each set is inherited from the mother and the father respectively, and its elements are called haplotypes. The haplotype phasing problem is, once genetic variants are discovered, to attribute them to either of the haplotypes. The first problem we consider is to efficiently index large collections of genomes. The Lempel-Ziv compression algorithms is a useful tool for this. We focus on two of its exponents, namely the RLZ and LZ77 algorithms. We analyze the first, and propose some modifications to both, to finally develop a scalable index for large and repetitive collections. Then, using that index, we propose a novel pipeline for variation calling to replace the single reference by thousands of them. We test our variation calling pipeline on a mutation-rich subsequence of a Finnish population genome. Our approach consistently outperforms the single-reference approach to variation calling. The second part of this thesis revolves around the haplotype phasing problem. First, we propose a generalization of sequence alignment for diploid genomes. Next we extend this model to offer a solution for the haplotype phasing problem in the family-trio setting (that is, when we know the variants present in an individual and in her parents). Finally, in the context of an existing read-based approach to haplotyping, we go back to basic algorithms, where we model the problem of pruning a set of reads aligned to a reference as an interval scheduling problem. We propose a exact solution that runs in subquadratic time and a 2-approximation algorithm that runs in linearithmic time.",
keywords = "113 Computer and information sciences",
author = "Valenzuela, {Serra Daniel Alejandro}",
year = "2017",
month = "6",
day = "9",
language = "English",
isbn = "978-951-51-3230-7",
series = "Series of publications / Department of Computer Science, University of Helsinki. A.",
publisher = "University of Helsinki",
number = "3",
address = "Finland",
school = "University of Helsinki",

}

Algorithms and Data Structures for Sequence Analysis in the Pan-Genomic Era. / Valenzuela, Serra Daniel Alejandro.

Helsinki : University of Helsinki, 2017. 74 s.

Forskningsoutput: AvhandlingDoktorsavhandlingSamling av artiklar

TY - THES

T1 - Algorithms and Data Structures for Sequence Analysis in the Pan-Genomic Era

AU - Valenzuela, Serra Daniel Alejandro

PY - 2017/6/9

Y1 - 2017/6/9

N2 - This thesis is motivated by two important processes in bioinformatics, namely variation calling and haplotyping. The contributions range from basic algorithms for sequence analysis, to the implementation of pipelines to deal with real data. Variation calling characterizes an individual's genome by identifying how it differs from a reference genome. It uses reads -- small DNA fragments -- extracted from a biological sample, and aligns them to the reference to identify the genetic variants present in the donor's genome. A related procedure is haplotype phasing. Sexual organisms have their genome organized in two sets of chromosomes, with equivalent functions. Each set is inherited from the mother and the father respectively, and its elements are called haplotypes. The haplotype phasing problem is, once genetic variants are discovered, to attribute them to either of the haplotypes. The first problem we consider is to efficiently index large collections of genomes. The Lempel-Ziv compression algorithms is a useful tool for this. We focus on two of its exponents, namely the RLZ and LZ77 algorithms. We analyze the first, and propose some modifications to both, to finally develop a scalable index for large and repetitive collections. Then, using that index, we propose a novel pipeline for variation calling to replace the single reference by thousands of them. We test our variation calling pipeline on a mutation-rich subsequence of a Finnish population genome. Our approach consistently outperforms the single-reference approach to variation calling. The second part of this thesis revolves around the haplotype phasing problem. First, we propose a generalization of sequence alignment for diploid genomes. Next we extend this model to offer a solution for the haplotype phasing problem in the family-trio setting (that is, when we know the variants present in an individual and in her parents). Finally, in the context of an existing read-based approach to haplotyping, we go back to basic algorithms, where we model the problem of pruning a set of reads aligned to a reference as an interval scheduling problem. We propose a exact solution that runs in subquadratic time and a 2-approximation algorithm that runs in linearithmic time.

AB - This thesis is motivated by two important processes in bioinformatics, namely variation calling and haplotyping. The contributions range from basic algorithms for sequence analysis, to the implementation of pipelines to deal with real data. Variation calling characterizes an individual's genome by identifying how it differs from a reference genome. It uses reads -- small DNA fragments -- extracted from a biological sample, and aligns them to the reference to identify the genetic variants present in the donor's genome. A related procedure is haplotype phasing. Sexual organisms have their genome organized in two sets of chromosomes, with equivalent functions. Each set is inherited from the mother and the father respectively, and its elements are called haplotypes. The haplotype phasing problem is, once genetic variants are discovered, to attribute them to either of the haplotypes. The first problem we consider is to efficiently index large collections of genomes. The Lempel-Ziv compression algorithms is a useful tool for this. We focus on two of its exponents, namely the RLZ and LZ77 algorithms. We analyze the first, and propose some modifications to both, to finally develop a scalable index for large and repetitive collections. Then, using that index, we propose a novel pipeline for variation calling to replace the single reference by thousands of them. We test our variation calling pipeline on a mutation-rich subsequence of a Finnish population genome. Our approach consistently outperforms the single-reference approach to variation calling. The second part of this thesis revolves around the haplotype phasing problem. First, we propose a generalization of sequence alignment for diploid genomes. Next we extend this model to offer a solution for the haplotype phasing problem in the family-trio setting (that is, when we know the variants present in an individual and in her parents). Finally, in the context of an existing read-based approach to haplotyping, we go back to basic algorithms, where we model the problem of pruning a set of reads aligned to a reference as an interval scheduling problem. We propose a exact solution that runs in subquadratic time and a 2-approximation algorithm that runs in linearithmic time.

KW - 113 Computer and information sciences

M3 - Doctoral Thesis

SN - 978-951-51-3230-7

T3 - Series of publications / Department of Computer Science, University of Helsinki. A.

PB - University of Helsinki

CY - Helsinki

ER -

Valenzuela SDA. Algorithms and Data Structures for Sequence Analysis in the Pan-Genomic Era. Helsinki: University of Helsinki, 2017. 74 s. (Series of publications / Department of Computer Science, University of Helsinki. A.; 3).