Efficient compressed indexing for approximate top-k string retrieval.

Héctor Ricardo Ferrada Escobar, Gonzalo Navarro

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Abstrakti

Given a collection of strings (called documents), the {\em top-k document retrieval} problem is that of, given a string pattern p, finding the k documents where p appears most often. This is a basic task in most information retrieval scenarios. The best current implementations require 20-30 bits per character (bpc) and k to 4k microseconds per query, or 12-24 bpc and 1-10 milliseconds per query. We introduce a Lempel-Ziv compressed data structure that occupies 5-10 bpc to answer queries in around k microseconds. The drawback is that the answer is approximate, but we show that its quality improves asymptotically with the size of the collection, being over 85% already for patterns of length 4-6 on rather small collections, and improving for larger ones.
Alkuperäiskielienglanti
OtsikkoUnknown host publication
Julkaisupäivä2014
TilaJulkaistu - 2014
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaSymposium on String Processing and Information Retrieval - Ouro Preto, Brasilia
Kesto: 1 tammik. 1800 → …
Konferenssinumero: 21

Siteeraa tätä