Fast in-memory XPath search using compressed indexes

Diego Arroyuelo, Francisco Claude, Sebastian Maneth, Veli Mäkinen, Gonzalo Navarro, Kim Nguyen, Jouni Sirén, Niko Välimäki

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.
Original languageEnglish
Title of host publicationICDE 2010 : 26th IEEE International Conference on Data Engineering
Number of pages12
PublisherIEEE Computer Society
Publication date2010
Pages417-428
ISBN (Print)978-1-4244-5446-4
DOIs
Publication statusPublished - 2010
MoE publication typeA4 Article in conference proceedings
EventInternational Conference on Data Engineering - Long Beach, California, United States
Duration: 1 Mar 20106 Mar 2010
Conference number: 26 (ICDE)

Fields of Science

  • 113 Computer and information sciences

Cite this

Arroyuelo, D., Claude, F., Maneth, S., Mäkinen, V., Navarro, G., Nguyen, K., ... Välimäki, N. (2010). Fast in-memory XPath search using compressed indexes. In ICDE 2010: 26th IEEE International Conference on Data Engineering (pp. 417-428). IEEE Computer Society. https://doi.org/10.1109/ICDE.2010.5447858
Arroyuelo, Diego ; Claude, Francisco ; Maneth, Sebastian ; Mäkinen, Veli ; Navarro, Gonzalo ; Nguyen, Kim ; Sirén, Jouni ; Välimäki, Niko. / Fast in-memory XPath search using compressed indexes. ICDE 2010: 26th IEEE International Conference on Data Engineering. IEEE Computer Society, 2010. pp. 417-428
@inproceedings{b29f153438304cc49516f69aeaf9ca3a,
title = "Fast in-memory XPath search using compressed indexes",
abstract = "A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.",
keywords = "113 Computer and information sciences",
author = "Diego Arroyuelo and Francisco Claude and Sebastian Maneth and Veli M{\"a}kinen and Gonzalo Navarro and Kim Nguyen and Jouni Sir{\'e}n and Niko V{\"a}lim{\"a}ki",
note = "Volume: Proceeding volume:",
year = "2010",
doi = "10.1109/ICDE.2010.5447858",
language = "English",
isbn = "978-1-4244-5446-4",
pages = "417--428",
booktitle = "ICDE 2010",
publisher = "IEEE Computer Society",
address = "International",

}

Arroyuelo, D, Claude, F, Maneth, S, Mäkinen, V, Navarro, G, Nguyen, K, Sirén, J & Välimäki, N 2010, Fast in-memory XPath search using compressed indexes. in ICDE 2010: 26th IEEE International Conference on Data Engineering. IEEE Computer Society, pp. 417-428, International Conference on Data Engineering, Long Beach, California, United States, 01/03/2010. https://doi.org/10.1109/ICDE.2010.5447858

Fast in-memory XPath search using compressed indexes. / Arroyuelo, Diego; Claude, Francisco; Maneth, Sebastian; Mäkinen, Veli; Navarro, Gonzalo; Nguyen, Kim; Sirén, Jouni; Välimäki, Niko.

ICDE 2010: 26th IEEE International Conference on Data Engineering. IEEE Computer Society, 2010. p. 417-428.

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Fast in-memory XPath search using compressed indexes

AU - Arroyuelo, Diego

AU - Claude, Francisco

AU - Maneth, Sebastian

AU - Mäkinen, Veli

AU - Navarro, Gonzalo

AU - Nguyen, Kim

AU - Sirén, Jouni

AU - Välimäki, Niko

N1 - Volume: Proceeding volume:

PY - 2010

Y1 - 2010

N2 - A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.

AB - A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.

KW - 113 Computer and information sciences

U2 - 10.1109/ICDE.2010.5447858

DO - 10.1109/ICDE.2010.5447858

M3 - Conference contribution

SN - 978-1-4244-5446-4

SP - 417

EP - 428

BT - ICDE 2010

PB - IEEE Computer Society

ER -

Arroyuelo D, Claude F, Maneth S, Mäkinen V, Navarro G, Nguyen K et al. Fast in-memory XPath search using compressed indexes. In ICDE 2010: 26th IEEE International Conference on Data Engineering. IEEE Computer Society. 2010. p. 417-428 https://doi.org/10.1109/ICDE.2010.5447858