An OCR system for the Unified Northern Alphabet

Research output: Chapter in Book/Report/Conference proceedingChapterScientificpeer-review

Abstract

This paper presents experiments done in order to build a functional OCR model for the Unified Northern Alphabet. This writing system was used between 1931 and 1937 for 16 (Uralic and non-Uralic) minority languages spoken in the Soviet Union. The character accuracy of the developed model reaches more than 98% and clearly shows cross-linguistic applicability. The tests described here therefore also include general guidelines for the amount of training data needed to boot-strap an OCR system under similar conditions.
Original languageEnglish
Title of host publicationProceedings of the fifth Workshop on Computational Linguistics for Uralic Languages
Number of pages13
PublisherThe Association for Computational Linguistics
Publication date2019
Pages77-89
ISBN (Electronic) 978-1-948087-92-6
Publication statusPublished - 2019
MoE publication typeA3 Book chapter
EventInternational Workshop on Computational Linguistics for Uralic Languages
- Tartu, Estonia
Duration: 7 Jan 20199 Jan 2019
Conference number: 5

Fields of Science

  • 6121 Languages

Cite this

Partanen, N., & Rießler, M. (2019). An OCR system for the Unified Northern Alphabet. In Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages (pp. 77-89). The Association for Computational Linguistics.
Partanen, Niko ; Rießler, Michael. / An OCR system for the Unified Northern Alphabet. Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages. The Association for Computational Linguistics, 2019. pp. 77-89
@inbook{d8417bb8d5f44dd29c6621b041a5cff3,
title = "An OCR system for the Unified Northern Alphabet",
abstract = "This paper presents experiments done in order to build a functional OCR model for the Unified Northern Alphabet. This writing system was used between 1931 and 1937 for 16 (Uralic and non-Uralic) minority languages spoken in the Soviet Union. The character accuracy of the developed model reaches more than 98{\%} and clearly shows cross-linguistic applicability. The tests described here therefore also include general guidelines for the amount of training data needed to boot-strap an OCR system under similar conditions.",
keywords = "6121 Languages",
author = "Niko Partanen and Michael Rie{\ss}ler",
year = "2019",
language = "English",
pages = "77--89",
booktitle = "Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages",
publisher = "The Association for Computational Linguistics",
address = "United States",

}

Partanen, N & Rießler, M 2019, An OCR system for the Unified Northern Alphabet. in Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages. The Association for Computational Linguistics, pp. 77-89, International Workshop on Computational Linguistics for Uralic Languages
, Tartu, Estonia, 07/01/2019.

An OCR system for the Unified Northern Alphabet. / Partanen, Niko; Rießler, Michael.

Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages. The Association for Computational Linguistics, 2019. p. 77-89.

Research output: Chapter in Book/Report/Conference proceedingChapterScientificpeer-review

TY - CHAP

T1 - An OCR system for the Unified Northern Alphabet

AU - Partanen, Niko

AU - Rießler, Michael

PY - 2019

Y1 - 2019

N2 - This paper presents experiments done in order to build a functional OCR model for the Unified Northern Alphabet. This writing system was used between 1931 and 1937 for 16 (Uralic and non-Uralic) minority languages spoken in the Soviet Union. The character accuracy of the developed model reaches more than 98% and clearly shows cross-linguistic applicability. The tests described here therefore also include general guidelines for the amount of training data needed to boot-strap an OCR system under similar conditions.

AB - This paper presents experiments done in order to build a functional OCR model for the Unified Northern Alphabet. This writing system was used between 1931 and 1937 for 16 (Uralic and non-Uralic) minority languages spoken in the Soviet Union. The character accuracy of the developed model reaches more than 98% and clearly shows cross-linguistic applicability. The tests described here therefore also include general guidelines for the amount of training data needed to boot-strap an OCR system under similar conditions.

KW - 6121 Languages

M3 - Chapter

SP - 77

EP - 89

BT - Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages

PB - The Association for Computational Linguistics

ER -

Partanen N, Rießler M. An OCR system for the Unified Northern Alphabet. In Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages. The Association for Computational Linguistics. 2019. p. 77-89