Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents
Original languageEnglish
Title of host publicationProceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
EditorsJörg Tiedeman
Number of pages283
PublisherLinköping University Electronic Press
Publication dateMay 2017
Pages277
ISBN (Electronic)978-91-7685-601-7
Publication statusPublished - May 2017
MoE publication typeA4 Article in conference proceedings
EventNordic Conference on Computational Linguistics - Gothenburg, Sweden
Duration: 22 May 201724 May 2017
Conference number: 21 (NoDaLiDa)

Publication series

NameLinköping Electronic Conference Proceedings
PublisherLinköping University Electronic Press, Linköpings universitet
Volume131
ISSN (Print)1650-3686
ISSN (Electronic)1650-3740
NameNEALT Proceedings Series
Volume29

Fields of Science

  • 113 Computer and information sciences

Cite this

Koistinen, J. M. O., Kettunen, K. T., & Pääkkönen, T. A. (2017). Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. In J. Tiedeman (Ed.), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden (pp. 277). (Linköping Electronic Conference Proceedings; Vol. 131), (NEALT Proceedings Series; Vol. 29). Linköping University Electronic Press.
Koistinen, Jani Mika Olavi ; Kettunen, Kimmo Tapio ; Pääkkönen, Tuula Anneli. / Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. editor / Jörg Tiedeman. Linköping University Electronic Press, 2017. pp. 277 (Linköping Electronic Conference Proceedings). (NEALT Proceedings Series).
@inproceedings{bedbbcc9842b4ff29c0acd9099670aef,
title = "Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing",
abstract = "In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48{\%} (FineReader 7 or 8) and 9.16{\%} (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents",
keywords = "113 Computer and information sciences, optinen luku",
author = "Koistinen, {Jani Mika Olavi} and Kettunen, {Kimmo Tapio} and P{\"a}{\"a}kk{\"o}nen, {Tuula Anneli}",
note = "Volume: Proceeding volume:",
year = "2017",
month = "5",
language = "English",
series = "Link{\"o}ping Electronic Conference Proceedings",
publisher = "Link{\"o}ping University Electronic Press",
pages = "277",
editor = "J{\"o}rg Tiedeman",
booktitle = "Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden",
address = "Sweden",

}

Koistinen, JMO, Kettunen, KT & Pääkkönen, TA 2017, Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. in J Tiedeman (ed.), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. Linköping Electronic Conference Proceedings, vol. 131, NEALT Proceedings Series, vol. 29, Linköping University Electronic Press, pp. 277, Nordic Conference on Computational Linguistics, Gothenburg, Sweden, 22/05/2017.

Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. / Koistinen, Jani Mika Olavi; Kettunen, Kimmo Tapio; Pääkkönen, Tuula Anneli.

Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. ed. / Jörg Tiedeman. Linköping University Electronic Press, 2017. p. 277 (Linköping Electronic Conference Proceedings; Vol. 131), (NEALT Proceedings Series; Vol. 29).

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

AU - Koistinen, Jani Mika Olavi

AU - Kettunen, Kimmo Tapio

AU - Pääkkönen, Tuula Anneli

N1 - Volume: Proceeding volume:

PY - 2017/5

Y1 - 2017/5

N2 - In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents

AB - In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents

KW - 113 Computer and information sciences

KW - optinen luku

M3 - Conference contribution

T3 - Linköping Electronic Conference Proceedings

SP - 277

BT - Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

A2 - Tiedeman, Jörg

PB - Linköping University Electronic Press

ER -

Koistinen JMO, Kettunen KT, Pääkkönen TA. Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. In Tiedeman J, editor, Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. Linköping University Electronic Press. 2017. p. 277. (Linköping Electronic Conference Proceedings). (NEALT Proceedings Series).