Multiple Set Matching with Bloom Matrix and Bloom Vector

Research output: Contribution to journalArticleScientificpeer-review

Abstract

Bloom Filter is a space-efficient probabilistic data structure for checking the membership of elements in a set. Given multiple sets, a standard Bloom Filter is not sufficient when looking for the items to which an element or a set of input elements belong. An example case is searching for documents with keywords in a large text corpus, which is essentially a multiple set matching problem where the input is single or multiple keywords, and the result is a set of possible candidate documents. This article solves the multiple set matching problem by proposing two efficient Bloom Multifilters called Bloom Matrix and Bloom Vector, which generalize the standard Bloom Filter. Both structures are space-efficient and answer queries with a set of identifiers for multiple set matching problems. The space efficiency can be optimized according to the distribution of labels among multiple sets: Uniform and Zipf. Bloom Vector efficiently exploits the Zipf distribution of data for further space reduction. Indeed, both structures are much more space-efficient compared with the state-of-the-art, Bloofi. The results also highlight that a Lookup operation on Bloom Matrix is significantly faster than on Bloom Vector and Bloofi.

Original languageEnglish
Article number21
JournalACM Transactions on Knowledge Discovery from Data
Volume14
Issue number2
Number of pages21
ISSN1556-4681
DOIs
Publication statusPublished - Mar 2020
MoE publication typeA1 Journal article-refereed

Fields of Science

  • 113 Computer and information sciences
  • Bloom filter
  • multiple sets
  • Bloomier filter
  • uniform distribution
  • Zipf distribution
  • big data
  • FILTER
  • CACHE

Cite this