Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents
Originalspråkengelska
Titel på gästpublikationProceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
RedaktörerJörg Tiedeman
Antal sidor283
FörlagLinköping University Electronic Press
Utgivningsdatummaj 2017
Sidor277
ISBN (elektroniskt)978-91-7685-601-7
StatusPublicerad - maj 2017
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangNordic Conference on Computational Linguistics - Gothenburg, Sverige
Varaktighet: 22 maj 201724 maj 2017
Konferensnummer: 21 (NoDaLiDa)

Publikationsserier

NamnLinköping Electronic Conference Proceedings
FörlagLinköping University Electronic Press, Linköpings universitet
Volym131
ISSN (tryckt)1650-3686
ISSN (elektroniskt)1650-3740
NamnNEALT Proceedings Series
Volym29

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap

Citera det här

Koistinen, J. M. O., Kettunen, K. T., & Pääkkönen, T. A. (2017). Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. I J. Tiedeman (Red.), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden (s. 277). (Linköping Electronic Conference Proceedings; Vol. 131), (NEALT Proceedings Series; Vol. 29). Linköping University Electronic Press.
Koistinen, Jani Mika Olavi ; Kettunen, Kimmo Tapio ; Pääkkönen, Tuula Anneli. / Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. redaktör / Jörg Tiedeman. Linköping University Electronic Press, 2017. s. 277 (Linköping Electronic Conference Proceedings). (NEALT Proceedings Series).
@inproceedings{bedbbcc9842b4ff29c0acd9099670aef,
title = "Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing",
abstract = "In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48{\%} (FineReader 7 or 8) and 9.16{\%} (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents",
keywords = "113 Computer and information sciences, optinen luku",
author = "Koistinen, {Jani Mika Olavi} and Kettunen, {Kimmo Tapio} and P{\"a}{\"a}kk{\"o}nen, {Tuula Anneli}",
note = "Volume: Proceeding volume:",
year = "2017",
month = "5",
language = "English",
series = "Link{\"o}ping Electronic Conference Proceedings",
publisher = "Link{\"o}ping University Electronic Press",
pages = "277",
editor = "J{\"o}rg Tiedeman",
booktitle = "Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden",
address = "Sweden",

}

Koistinen, JMO, Kettunen, KT & Pääkkönen, TA 2017, Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. i J Tiedeman (red.), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. Linköping Electronic Conference Proceedings, vol. 131, NEALT Proceedings Series, vol. 29, Linköping University Electronic Press, s. 277, Nordic Conference on Computational Linguistics, Gothenburg, Sverige, 22/05/2017.

Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. / Koistinen, Jani Mika Olavi; Kettunen, Kimmo Tapio; Pääkkönen, Tuula Anneli.

Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. red. / Jörg Tiedeman. Linköping University Electronic Press, 2017. s. 277 (Linköping Electronic Conference Proceedings; Vol. 131), (NEALT Proceedings Series; Vol. 29).

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

TY - GEN

T1 - Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

AU - Koistinen, Jani Mika Olavi

AU - Kettunen, Kimmo Tapio

AU - Pääkkönen, Tuula Anneli

N1 - Volume: Proceeding volume:

PY - 2017/5

Y1 - 2017/5

N2 - In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents

AB - In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents

KW - 113 Computer and information sciences

KW - optinen luku

M3 - Conference contribution

T3 - Linköping Electronic Conference Proceedings

SP - 277

BT - Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

A2 - Tiedeman, Jörg

PB - Linköping University Electronic Press

ER -

Koistinen JMO, Kettunen KT, Pääkkönen TA. Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. I Tiedeman J, redaktör, Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. Linköping University Electronic Press. 2017. s. 277. (Linköping Electronic Conference Proceedings). (NEALT Proceedings Series).