Abstrakti

Text reuse is of fundamental importance in humanities research, as near-verbatim pieces of text in different documents provide invaluable information about the historical spread, evolution of ideas and composition of cultural artifacts. Traditionally, scholars have studied text reuse at a very small scale, for example, when comparing the writings of two philosophers; however, modern digitized corpora spanning entire centuries promise to revolutionize humanities research through the detection of previously unobserved large-scale patterns. This paper presents insights from ReceptionReader, a system for large-scale text reuse analysis over almost all known 18th-century books, articles, and newspapers. The system implements a data management pipeline for billions of text reuse instances and supports analysis tasks based on database queries (e.g., retrieving the most reused quotes from queried documents). The paper describes the principled and extensive evaluations across different normalization levels, query execution engines, and queries of interest that led to an optimized system—and offers insights from the observed trade-offs and how they were resolved to fit specific requirements. In summary, the paper explains how, for our system, (1) the row-store engine (MariaDB Aria) with denormalized relations emerged as the optimal choice for front-end interfaces, while (2) big data processing (Apache Spark) proved irreplaceable for data preprocessing.
Alkuperäiskielienglanti
LehtiInternational Journal of Data Science and Analytics
Sivumäärä13
ISSN2364-415X
DOI - pysyväislinkit
TilaJulkaistu - 7 huhtik. 2025
OKM-julkaisutyyppiA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä, vertaisarvioitu

Tieteenalat

  • 6121 Kielitieteet
  • 113 Tietojenkäsittely- ja informaatiotieteet

Siteeraa tätä