The information contained in the genome of an organism, its DNA, is expressed through transcription of its genes to RNA, in quantities determined by many internal and external factors. As such, studying the gene expression can give valuable information for e.g. clinical diagnostics. A common analysis workflow of RNA-sequencing (RNA-seq) data consists of mapping the sequencing reads to a reference genome, followed by the transcript assembly and quantification based on these alignments. The advent of second-generation sequencing revolutionized the field by reducing the sequencing costs by 50,000-fold. Now another revolution is imminent with the third-generation sequencing platforms producing an order of magnitude higher read lengths. However, higher error rate, higher cost and lower throughput compared to the second-generation sequencing bring their own challenges. To compensate for the low throughput and high cost, hybrid approaches using both short second-generation and long third-generation reads have gathered recent interest. The first part of this thesis focuses on the analysis of short-read RNA-seq data. As short-read mapping is an already well-researched field, we focus on giving a literature review of the topic. For transcript assembly we propose a novel (at the time of the publication) approach of using minimum-cost flows to solve the problem of covering a graph created from the read alignments with a set of paths with the minimum cost, under some cost model. Various network-flow-based solutions were proposed in parallel to, as well as after, ours. The second part, where the main contributions of this thesis lie, focuses on the analysis of long-read RNA-seq data. The driving point of our research has been the Minimum Path Cover with Subpath Constraints (MPC-SC) model, where transcript assembly is modeled as a minimum path cover problem, with the addition that each of the chains of exons (subpath constraints) created from the long reads must be completely contained in a solution path. In addition to implementing this concept, we experimentally studied different approaches on how to find the exon chains in practice. The evaluated approaches included aligning the long reads to a graph created from short read alignments instead of the reference genome, which led to our final contribution: extending a co-linear chaining algorithm from between two sequences to between a sequence and a directed acyclic graph.
|Myöntöpäivämäärä||20 joulukuuta 2017|
|Tila||Julkaistu - 20 joulukuuta 2017|
|OKM-julkaisutyyppi||G5 Tohtorinväitöskirja (artikkeli)|
- 113 Tietojenkäsittely- ja informaatiotieteet