Abstract

Text reuse is of fundamental importance in humanities research, as near-verbatim pieces of text in different documents provide invaluable information about the historical spread, evolution of ideas and composition of cultural artifacts. Traditionally, scholars have studied text reuse at a very small scale, for example, when comparing the writings of two philosophers; however, modern digitized corpora spanning entire centuries promise to revolutionize humanities research through the detection of previously unobserved large-scale patterns. This paper presents insights from ReceptionReader, a system for large-scale text reuse analysis over almost all known 18th-century books, articles, and newspapers. The system implements a data management pipeline for billions of text reuse instances and supports analysis tasks based on database queries (e.g., retrieving the most reused quotes from queried documents). The paper describes the principled and extensive evaluations across different normalization levels, query execution engines, and queries of interest that led to an optimized system—and offers insights from the observed trade-offs and how they were resolved to fit specific requirements. In summary, the paper explains how, for our system, (1) the row-store engine (MariaDB Aria) with denormalized relations emerged as the optimal choice for front-end interfaces, while (2) big data processing (Apache Spark) proved irreplaceable for data preprocessing.
Original languageEnglish
JournalInternational Journal of Data Science and Analytics
Number of pages13
ISSN2364-415X
DOIs
Publication statusPublished - 7 Apr 2025
MoE publication typeA1 Journal article-refereed

Fields of Science

  • 6121 Languages
  • 113 Computer and information sciences

Cite this