Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Kuvaus

In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents
Alkuperäiskielienglanti
OtsikkoProceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
ToimittajatJörg Tiedeman
Sivumäärä283
KustantajaLinköping University Electronic Press
Julkaisupäivätoukokuuta 2017
Sivut277
ISBN (elektroninen)978-91-7685-601-7
TilaJulkaistu - toukokuuta 2017
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaNordic Conference on Computational Linguistics - Gothenburg, Ruotsi
Kesto: 22 toukokuuta 201724 toukokuuta 2017
Konferenssinumero: 21 (NoDaLiDa)

Julkaisusarja

NimiLinköping Electronic Conference Proceedings
KustantajaLinköping University Electronic Press, Linköpings universitet
Vuosikerta131
ISSN (painettu)1650-3686
ISSN (elektroninen)1650-3740
NimiNEALT Proceedings Series
Vuosikerta29

Lisätietoja


Volume:
Proceeding volume:

Tieteenalat

  • 113 Tietojenkäsittely- ja informaatiotieteet
  • optinen luku

Lainaa tätä

Koistinen, J. M. O., Kettunen, K. T., & Pääkkönen, T. A. (2017). Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. teoksessa J. Tiedeman (Toimittaja), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden (Sivut 277). (Linköping Electronic Conference Proceedings; Vuosikerta 131), (NEALT Proceedings Series; Vuosikerta 29). Linköping University Electronic Press.
Koistinen, Jani Mika Olavi ; Kettunen, Kimmo Tapio ; Pääkkönen, Tuula Anneli. / Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. Toimittaja / Jörg Tiedeman. Linköping University Electronic Press, 2017. Sivut 277 (Linköping Electronic Conference Proceedings). (NEALT Proceedings Series).
@inproceedings{bedbbcc9842b4ff29c0acd9099670aef,
title = "Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing",
abstract = "In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48{\%} (FineReader 7 or 8) and 9.16{\%} (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents",
keywords = "113 Computer and information sciences, optinen luku",
author = "Koistinen, {Jani Mika Olavi} and Kettunen, {Kimmo Tapio} and P{\"a}{\"a}kk{\"o}nen, {Tuula Anneli}",
note = "Volume: Proceeding volume:",
year = "2017",
month = "5",
language = "English",
series = "Link{\"o}ping Electronic Conference Proceedings",
publisher = "Link{\"o}ping University Electronic Press",
pages = "277",
editor = "J{\"o}rg Tiedeman",
booktitle = "Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden",
address = "Sweden",

}

Koistinen, JMO, Kettunen, KT & Pääkkönen, TA 2017, Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. julkaisussa J Tiedeman (Toimittaja), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. Linköping Electronic Conference Proceedings, Vuosikerta 131, NEALT Proceedings Series, Vuosikerta 29, Linköping University Electronic Press, Sivut 277, Nordic Conference on Computational Linguistics, Gothenburg, Ruotsi, 22/05/2017.

Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. / Koistinen, Jani Mika Olavi; Kettunen, Kimmo Tapio; Pääkkönen, Tuula Anneli.

Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. toim. / Jörg Tiedeman. Linköping University Electronic Press, 2017. s. 277 (Linköping Electronic Conference Proceedings; Vuosikerta 131), (NEALT Proceedings Series; Vuosikerta 29).

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

TY - GEN

T1 - Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

AU - Koistinen, Jani Mika Olavi

AU - Kettunen, Kimmo Tapio

AU - Pääkkönen, Tuula Anneli

N1 - Volume: Proceeding volume:

PY - 2017/5

Y1 - 2017/5

N2 - In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents

AB - In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents

KW - 113 Computer and information sciences

KW - optinen luku

M3 - Conference contribution

T3 - Linköping Electronic Conference Proceedings

SP - 277

BT - Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

A2 - Tiedeman, Jörg

PB - Linköping University Electronic Press

ER -

Koistinen JMO, Kettunen KT, Pääkkönen TA. Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing. julkaisussa Tiedeman J, toimittaja, Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. Linköping University Electronic Press. 2017. s. 277. (Linköping Electronic Conference Proceedings). (NEALT Proceedings Series).