Sammanfattning
Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen much use in historical text analysis. In this paper we explore the impact of OCR noise on two embedding-based models, Gaussian LDA and the Embedded Topic Model (ETM) and compare their performance to LDA. Our results show that these models, especially ETM, are slightly more resilient than LDA in the presence of noise in terms of topic quality and classification accuracy.
Originalspråk | engelska |
---|---|
Titel på värdpublikation | Towards Open and Trustworthy Digital Societies. ICADL 2021 |
Redaktörer | Hao-Ren Ke, Chei Sian Lee, Kazunari Sugiyama |
Antal sidor | 9 |
Utgivningsort | Cham |
Förlag | Springer |
Utgivningsdatum | 30 nov. 2021 |
Sidor | 392-400 |
ISBN (tryckt) | 978-3-030-91668-8 |
ISBN (elektroniskt) | 978-3-030-91669-5 |
DOI | |
Status | Publicerad - 30 nov. 2021 |
MoE-publikationstyp | A4 Artikel i en konferenspublikation |
Evenemang | International Conference on Asia-Pacific Digital Libraries - online Varaktighet: 1 dec. 2021 → 3 dec. 2021 Konferensnummer: 23 https://icadl.net/icadl2021/ |
Publikationsserier
Namn | Lecture Notes in Computer Science |
---|---|
Volym | 13133 |
ISSN (tryckt) | 0302-9743 |
ISSN (elektroniskt) | 1611-3349 |
Vetenskapsgrenar
- 113 Data- och informationsvetenskap