A simple alphabet-independent FM-index

Szymon Grabowski, Gonzalo Navarro, Rafal Przywarski, Alejandro Salinger, Veli Mäkinen

Forskningsoutput: TidskriftsbidragArtikelVetenskapligPeer review

Sammanfattning

We design a succinct full-text index based on the idea of Huffman-compressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, sigma, present in that structure. On a text of length n with zero-order entropy H-0, our index needs O(n(H-0 + 1)) bits of space, without any significant dependence on or. The average search time for a pattern of length m is O(m(H-0 + 1)), under reasonable assumptions. Each position of a text occurrence can be located in worst case time O((H-0 + 1) log n), while any text substring of length L can be retrieved in O((H-0 + 1)L) average time in addition to the previous worst case time. Our index provides a relevant space/time tradeoff between existing succinct data structures, with the additional interest of being easy to implement. We also explore other coding variants alternative to Huffman and exploit their synchronization properties. Our experimental results on various types of texts show that our indexes are highly competitive in the space/time tradeoff map.
Originalspråkengelska
TidskriftInternational Journal of Foundations of Computer Science
Volym17
Utgåva6
Sidor (från-till)1365-1384
Antal sidor20
ISSN0129-0541
DOI
StatusPublicerad - 2006
MoE-publikationstypA1 Tidskriftsartikel-refererad

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap

Citera det här

Grabowski, Szymon ; Navarro, Gonzalo ; Przywarski, Rafal ; Salinger, Alejandro ; Mäkinen, Veli. / A simple alphabet-independent FM-index. I: International Journal of Foundations of Computer Science. 2006 ; Vol. 17, Nr. 6. s. 1365-1384.
@article{f4a13740fcc24b8aa6002924dcb298b2,
title = "A simple alphabet-independent FM-index",
abstract = "We design a succinct full-text index based on the idea of Huffman-compressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, sigma, present in that structure. On a text of length n with zero-order entropy H-0, our index needs O(n(H-0 + 1)) bits of space, without any significant dependence on or. The average search time for a pattern of length m is O(m(H-0 + 1)), under reasonable assumptions. Each position of a text occurrence can be located in worst case time O((H-0 + 1) log n), while any text substring of length L can be retrieved in O((H-0 + 1)L) average time in addition to the previous worst case time. Our index provides a relevant space/time tradeoff between existing succinct data structures, with the additional interest of being easy to implement. We also explore other coding variants alternative to Huffman and exploit their synchronization properties. Our experimental results on various types of texts show that our indexes are highly competitive in the space/time tradeoff map.",
keywords = "113 Computer and information sciences",
author = "Szymon Grabowski and Gonzalo Navarro and Rafal Przywarski and Alejandro Salinger and Veli M{\"a}kinen",
year = "2006",
doi = "10.1142/S0129054106004467",
language = "English",
volume = "17",
pages = "1365--1384",
journal = "International Journal of Foundations of Computer Science",
issn = "0129-0541",
publisher = "WORLD SCIENTIFIC PUBL CO PTE LTD",
number = "6",

}

A simple alphabet-independent FM-index. / Grabowski, Szymon; Navarro, Gonzalo; Przywarski, Rafal; Salinger, Alejandro; Mäkinen, Veli.

I: International Journal of Foundations of Computer Science, Vol. 17, Nr. 6, 2006, s. 1365-1384.

Forskningsoutput: TidskriftsbidragArtikelVetenskapligPeer review

TY - JOUR

T1 - A simple alphabet-independent FM-index

AU - Grabowski, Szymon

AU - Navarro, Gonzalo

AU - Przywarski, Rafal

AU - Salinger, Alejandro

AU - Mäkinen, Veli

PY - 2006

Y1 - 2006

N2 - We design a succinct full-text index based on the idea of Huffman-compressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, sigma, present in that structure. On a text of length n with zero-order entropy H-0, our index needs O(n(H-0 + 1)) bits of space, without any significant dependence on or. The average search time for a pattern of length m is O(m(H-0 + 1)), under reasonable assumptions. Each position of a text occurrence can be located in worst case time O((H-0 + 1) log n), while any text substring of length L can be retrieved in O((H-0 + 1)L) average time in addition to the previous worst case time. Our index provides a relevant space/time tradeoff between existing succinct data structures, with the additional interest of being easy to implement. We also explore other coding variants alternative to Huffman and exploit their synchronization properties. Our experimental results on various types of texts show that our indexes are highly competitive in the space/time tradeoff map.

AB - We design a succinct full-text index based on the idea of Huffman-compressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, sigma, present in that structure. On a text of length n with zero-order entropy H-0, our index needs O(n(H-0 + 1)) bits of space, without any significant dependence on or. The average search time for a pattern of length m is O(m(H-0 + 1)), under reasonable assumptions. Each position of a text occurrence can be located in worst case time O((H-0 + 1) log n), while any text substring of length L can be retrieved in O((H-0 + 1)L) average time in addition to the previous worst case time. Our index provides a relevant space/time tradeoff between existing succinct data structures, with the additional interest of being easy to implement. We also explore other coding variants alternative to Huffman and exploit their synchronization properties. Our experimental results on various types of texts show that our indexes are highly competitive in the space/time tradeoff map.

KW - 113 Computer and information sciences

U2 - 10.1142/S0129054106004467

DO - 10.1142/S0129054106004467

M3 - Article

VL - 17

SP - 1365

EP - 1384

JO - International Journal of Foundations of Computer Science

JF - International Journal of Foundations of Computer Science

SN - 0129-0541

IS - 6

ER -