Abstract
A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. Our algorithm first constructs the de Bruijn graph in linear time and then uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.
Original language | English |
---|---|
Article number | 5 |
Journal | Algorithms for Molecular Biology |
Volume | 18 |
Issue number | 1 |
Number of pages | 21 |
ISSN | 1748-7188 |
DOIs | |
Publication status | Published - 4 Jul 2023 |
MoE publication type | A1 Journal article-refereed |
Fields of Science
- Bidirected arc-centric de Bruijn graph
- Eulerian cycle
- K-mer based methods
- Spectrum preserving string sets
- Suffix tree
- 113 Computer and information sciences