Sammanfattning

Text reuse is a methodological element of fundamental importance in humanities research: pieces of text that re-appear across different documents, verbatim or paraphrased, provide invaluable information about the historical spread and evolution of ideas, enable researchers to quantify the intellectual influence of specific authors and literary works, and sometimes offer precious clues in revealing the identity of an author or detecting plagiarism. Traditionally, humanities scholars have studied text reuse at a very small scale, for example, when comparing the opinions of two authors (e.g., 18th-century philosophers) as they appear in their texts (e.g., their articles). Against that backdrop, large modern digitized corpora offer the opportunity to revolutionize humanities research, as they enable the joint analysis of text collections that span entire centuries and the detection of large-scale patterns, impossible to detect with traditional small-scale analysis. For this opportunity to materialize, it is necessary to develop efficient data science systems that perform the corresponding analysis tasks.
In this paper, we share insights from ReceptionReader, a system for analyzing text reuse in large historical corpora. In particular, we describe the data management considerations that have allowed us to move from an originally deployed but unoptimized version of the system to an extensively evaluated and substantially optimized version, which is destined to be the newly deployed version of the system. The system is built upon large digitized corpora of 18th-century texts, which include almost all known books, articles, and newspapers from that period. It implements a data management and processing pipeline for billions of instances of text reuse. Its main functionality is to perform downstream analysis tasks related to text reuse, such as finding the instances of text reuse that stem from a given article or identifying the most reused quotes from a set of documents, with each task expressed as a database query. For the purposes of the paper, we discuss the related design options for data management, including various levels of database normalization and query execution, including combinations of distributed data processing (Apache Spark), indexed row store engine (MariaDB Aria), and compressed column store engine (MariaDB Columnstore). Moreover, we present an extensive evaluation in terms of various metrics of interest (latency, storage size, and computing costs) for varying workloads, and we offer insights from the trade-offs we observed and the choices that emerged as optimal in our setting. In summary, our results demonstrate that (1) for the workloads that are most relevant to text-reuse analysis, the row store engine (MariaDB Aria) emerges as the overall optimal choice for executing analysis tasks at the user-end of the pipeline, (2) big data processing (Apache Spark) is irreplaceable for all processing stages of the system's pipeline.
Originalspråkengelska
StatusInsänt - jan. 2024
MoE-publikationstypEj behörig

Citera det här