Evaluating the Robustness of Embedding-based Topic Models to OCR Noise

Elaine Zosa, Mark Granroth-Wilding, Stephen Mutuvi, Antoine Doucet

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review


Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen much use in historical text analysis. In this paper we explore the impact of OCR noise on two embedding-based models, Gaussian LDA and the Embedded Topic Model (ETM) and compare their performance to LDA. Our results show that these models, especially ETM, are slightly more resilient than LDA in the presence of noise in terms of topic quality and classification accuracy.
Original languageEnglish
Title of host publicationTowards Open and Trustworthy Digital Societies. ICADL 2021
EditorsHao-Ren Ke, Chei Sian Lee, Kazunari Sugiyama
Number of pages9
Place of PublicationCham
Publication date30 Nov 2021
ISBN (Print)978-3-030-91668-8
ISBN (Electronic)978-3-030-91669-5
Publication statusPublished - 30 Nov 2021
MoE publication typeA4 Article in conference proceedings
EventInternational Conference on Asia-Pacific Digital Libraries - online
Duration: 1 Dec 20213 Dec 2021
Conference number: 23

Publication series

NameLecture Notes in Computer Science
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Fields of Science

  • 113 Computer and information sciences
  • topic modelling
  • word embeddings
  • OCR

Cite this