Projekteja vuodessa
Abstrakti
Many large-scale investigations of textual data are based on the automated identification of various linguistic features. However, if the textual data is of lower quality, automated identification of linguistic features, particularly more complex ones, can be severely hampered. Data quality problems are particularly prominent with large datasets of historical text which have been made machine-readable using optical character recognition (OCR) technology, but it is unclear how much the identification of individual linguistic features is affected by the dirty OCR, and how features of varying complexity are influenced differently. In this paper, I analyze the effect of OCR quality on the automated identification of the set of linguistic features commonly used for multi-dimensional register analysis (MDA) by comparing their observed frequencies in the OCR-processed Eighteenth Century Collections Online (ECCO) and a clean baseline (ECCO-TCP). The results show that the identification of most features is disturbed more as the OCR quality decreases, but different features start degrading at different OCR quality levels and do so at different rates.
Alkuperäiskieli | englanti |
---|---|
Otsikko | Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages |
Toimittajat | Mika Hämäläinen, Emily Öhman, Flammie Pirinen, Khalid Alnajjar, So Miyagawa, Yuri Bizzoni, Niko Partanen, Jack Rueter |
Sivumäärä | 7 |
Julkaisupaikka | Stroudsburg |
Kustantaja | The Association for Computational Linguistics |
Julkaisupäivä | 2023 |
Sivut | 45-51 |
ISBN (elektroninen) | 979-8-89176-012-7 |
Tila | Julkaistu - 2023 |
OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisuussa |
Tapahtuma | International Conference on Natural Language Processing for Digital Humanities - Waseda University, Tokyo, Japani Kesto: 1 jouluk. 2023 → 3 jouluk. 2023 Konferenssinumero: 3 |
Tieteenalat
- 6121 Kielitieteet
Projektit
- 1 Päättynyt
-
RiCEP: Rise of commercial society and eighteenth-century publishing
Tolonen, M. (Principal Investigator) & Säily, T. (Co-Principal Investigator)
Academy of Finland, Finland, Suomen Akatemia Projektilaskutus
01/09/2020 → 31/08/2024
Projekti: Tutkimusprojekti