Abstract
This paper presents work that has been carried out in the National Library of Finland to improve optical character recognition (OCR) quality of a Finnish historical newspaper and journal collection 1771–1910. Work and results reported in the paper are based on a 500 000 word ground truth (GT) sample of the Finnish language part of the whole collection. The sample has three different parallel parts: a manually corrected ground truth version, original OCR with ABBYY FineReader v. 7 or v. 8, and an ABBYY FineReader v. 11 re-OCRed version. Based on this sample and its page image originals we have developed a re-OCRing procedure using the open source software package Tesseract v. 3.04.01. Our methods in the re-OCR include image preprocessing techniques, usage of a morphological analyzer and a set of weighting rules for resulting words. Besides results based on the GT sample we present also results of re-OCR for a 10 year period of one newspaper of our collection, Uusi Suometar.
Original language | English |
---|---|
Pages | 11-13 |
Number of pages | 2 |
Publication status | Published - Apr 2018 |
Event | IAPR International Workshop on Document Analysis System: DAS 2018 - Wien, Austria Duration: 24 Apr 2018 → 27 Apr 2018 Conference number: 13 |
Conference
Conference | IAPR International Workshop on Document Analysis System |
---|---|
Country/Territory | Austria |
City | Wien |
Period | 24/04/2018 → 27/04/2018 |
Fields of Science
- 518 Media and communications