Effect of data quality on the automated identification of register features in Eighteenth Century Collections Online

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review


Many large-scale investigations of textual data are based on the automated identification of various linguistic features. However, if the textual data is of lower quality, automated identification of linguistic features, particularly more complex ones, can be severely hampered. Data quality problems are particularly prominent with large datasets of historical text which have been made machine-readable using optical character recognition (OCR) technology, but it is unclear how much the identification of individual linguistic features is affected by the dirty OCR, and how features of varying complexity are influenced differently. In this paper, I analyze the effect of OCR quality on the automated identification of the set of linguistic features commonly used for multi-dimensional register analysis (MDA) by comparing their observed frequencies in the OCR-processed Eighteenth Century Collections Online (ECCO) and a clean baseline (ECCO-TCP). The results show that the identification of most features is disturbed more as the OCR quality decreases, but different features start degrading at different OCR quality levels and do so at different rates.
Titel på värdpublikationProceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
RedaktörerMika Hämäläinen, Emily Öhman, Flammie Pirinen, Khalid Alnajjar, So Miyagawa, Yuri Bizzoni, Niko Partanen, Jack Rueter
Antal sidor7
FörlagThe Association for Computational Linguistics
ISBN (elektroniskt)979-8-89176-012-7
StatusPublicerad - 2023
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangInternational Conference on Natural Language Processing for Digital Humanities - Waseda University, Tokyo, Japan
Varaktighet: 1 dec. 20233 dec. 2023
Konferensnummer: 3


  • 6121 Språkvetenskaper

Citera det här