Re-OCR in Action – Using Tesseract to Re-OCR Finnish Fraktur from 19th and Early 20th Century Newspapers and Journals

Kimmo Tapio Kettunen, Jani Mika Olavi Koistinen

Research output: Conference materialsPaperpeer-review

Abstract

This paper presents work that has been carried out in the National Library of Finland to improve optical character recognition (OCR) quality of a Finnish historical newspaper and journal collection 1771–1910. Work and results reported in the paper are based on a 500 000 word ground truth (GT) sample of the Finnish language part of the whole collection. The sample has three different parallel parts: a manually corrected ground truth version, original OCR with ABBYY FineReader v. 7 or v. 8, and an ABBYY FineReader v. 11 re-OCRed version. Based on this sample and its page image originals we have developed a re-OCRing procedure using the open source software package Tesseract v. 3.04.01. Our methods in the re-OCR include image preprocessing techniques, usage of a morphological analyzer and a set of weighting rules for resulting words. Besides results based on the GT sample we present also results of re-OCR for a 10 year period of one newspaper of our collection, Uusi Suometar.
Original languageEnglish
Pages11-13
Number of pages2
Publication statusPublished - Apr 2018
EventIAPR International Workshop on Document Analysis System: DAS 2018 - Wien, Austria
Duration: 24 Apr 201827 Apr 2018
Conference number: 13

Conference

ConferenceIAPR International Workshop on Document Analysis System
Country/TerritoryAustria
CityWien
Period24/04/201827/04/2018

Fields of Science

  • 518 Media and communications

Cite this