Storage and retrieval of individual genomes

Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, Niko Välimäki

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Kuvaus

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log σ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection.

We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n.

We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.
Alkuperäiskielienglanti
OtsikkoResearch in Computational Molecular Biology : 13th Annual International Conference, RECOMB 2009
ToimittajatSerafim Batzoglou
Sivumäärä17
KustantajaSpringer
Julkaisupäivä2009
Sivut121-137
ISBN (painettu)978-3-642-02007-0
DOI - pysyväislinkit
TilaJulkaistu - 2009
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaAnnual International Conference on Research in Computational Molecular Biology - Tucson, Arizona, Yhdysvallat (USA)
Kesto: 18 toukokuuta 200921 toukokuuta 2009
Konferenssinumero: 13

Julkaisusarja

NimiLecture Notes in Computer Science
Numero5541

Lisätietoja


Volume: 5541
Proceeding volume:

Tieteenalat

  • 113 Tietojenkäsittely- ja informaatiotieteet

Lainaa tätä

Mäkinen, V., Navarro, G., Sirén, J., & Välimäki, N. (2009). Storage and retrieval of individual genomes. teoksessa S. Batzoglou (Toimittaja), Research in Computational Molecular Biology: 13th Annual International Conference, RECOMB 2009 (Sivut 121-137). (Lecture Notes in Computer Science; Nro 5541). Springer. https://doi.org/10.1007/978-3-642-02008-7_9
Mäkinen, Veli ; Navarro, Gonzalo ; Sirén, Jouni ; Välimäki, Niko. / Storage and retrieval of individual genomes. Research in Computational Molecular Biology: 13th Annual International Conference, RECOMB 2009. Toimittaja / Serafim Batzoglou. Springer, 2009. Sivut 121-137 (Lecture Notes in Computer Science; 5541).
@inproceedings{ef6db02dab1044468b3b8f3673930326,
title = "Storage and retrieval of individual genomes",
abstract = "A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log σ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection.We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n.We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.",
keywords = "113 Computer and information sciences",
author = "Veli M{\"a}kinen and Gonzalo Navarro and Jouni Sir{\'e}n and Niko V{\"a}lim{\"a}ki",
note = "Volume: 5541 Proceeding volume:",
year = "2009",
doi = "10.1007/978-3-642-02008-7_9",
language = "English",
isbn = "978-3-642-02007-0",
series = "Lecture Notes in Computer Science",
publisher = "Springer",
number = "5541",
pages = "121--137",
editor = "Serafim Batzoglou",
booktitle = "Research in Computational Molecular Biology",
address = "United States",

}

Mäkinen, V, Navarro, G, Sirén, J & Välimäki, N 2009, Storage and retrieval of individual genomes. julkaisussa S Batzoglou (Toimittaja), Research in Computational Molecular Biology: 13th Annual International Conference, RECOMB 2009. Lecture Notes in Computer Science, Nro 5541, Springer, Sivut 121-137, Annual International Conference on Research in Computational Molecular Biology, Tucson, Arizona, Yhdysvallat (USA), 18/05/2009. https://doi.org/10.1007/978-3-642-02008-7_9

Storage and retrieval of individual genomes. / Mäkinen, Veli; Navarro, Gonzalo; Sirén, Jouni; Välimäki, Niko.

Research in Computational Molecular Biology: 13th Annual International Conference, RECOMB 2009. toim. / Serafim Batzoglou. Springer, 2009. s. 121-137 (Lecture Notes in Computer Science; Nro 5541).

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

TY - GEN

T1 - Storage and retrieval of individual genomes

AU - Mäkinen, Veli

AU - Navarro, Gonzalo

AU - Sirén, Jouni

AU - Välimäki, Niko

N1 - Volume: 5541 Proceeding volume:

PY - 2009

Y1 - 2009

N2 - A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log σ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection.We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n.We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.

AB - A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log σ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection.We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n.We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.

KW - 113 Computer and information sciences

U2 - 10.1007/978-3-642-02008-7_9

DO - 10.1007/978-3-642-02008-7_9

M3 - Conference contribution

SN - 978-3-642-02007-0

T3 - Lecture Notes in Computer Science

SP - 121

EP - 137

BT - Research in Computational Molecular Biology

A2 - Batzoglou, Serafim

PB - Springer

ER -

Mäkinen V, Navarro G, Sirén J, Välimäki N. Storage and retrieval of individual genomes. julkaisussa Batzoglou S, toimittaja, Research in Computational Molecular Biology: 13th Annual International Conference, RECOMB 2009. Springer. 2009. s. 121-137. (Lecture Notes in Computer Science; 5541). https://doi.org/10.1007/978-3-642-02008-7_9