Initial OCR Word Form List for Scanning of Ingrian, 1930s (2012-11-06)

Research output: Non-textual formSoftwareScientific

Abstract

OCR programs can reach a higher level of accuracy when supplied with a list of word forms representing high frequency in a given text. This initial word-form list (2012-11-06) has been derived from the words available in a grammar of the Ingrian language by V. Junus, "Iƶoran keelen GRAMMATIKKA" (UCPEDGIZ, Leningrad-Moskva, 1936). The actual word forms appearing in the text have been described in a finite-state transducer in lemma-stem pairs with continuation lexica to cover morphological variation. The transducer has then been used to produce over 180 thousand random forms to enhance the list of 8122 unique original word forms. The resulting list will soon be applied in the recognition of 20 Ingrian-language textbooks from the 1930s in a collaborative digitizing pilot involving the National Library of Finland, the University of Helsinki Library and the National Library of Russia in St. Petersburg.
Original languageEnglish
Publication statusPublished - 6 Nov 2012
MoE publication typeI2 ICT software

Fields of Science

  • 6121 Languages
  • Ingrian language
  • OCR
  • HFST
  • Giellatekno
  • Kone Language Programme
  • Minority languages
  • Accessibility
  • Uralic languages
  • WORD-RECOGNITION

Cite this

@misc{8cd8cd8e23ff4387b519ebf8dfcbbd80,
title = "Initial OCR Word Form List for Scanning of Ingrian, 1930s (2012-11-06)",
abstract = "OCR programs can reach a higher level of accuracy when supplied with a list of word forms representing high frequency in a given text. This initial word-form list (2012-11-06) has been derived from the words available in a grammar of the Ingrian language by V. Junus, {"}Iƶoran keelen GRAMMATIKKA{"} (UCPEDGIZ, Leningrad-Moskva, 1936). The actual word forms appearing in the text have been described in a finite-state transducer in lemma-stem pairs with continuation lexica to cover morphological variation. The transducer has then been used to produce over 180 thousand random forms to enhance the list of 8122 unique original word forms. The resulting list will soon be applied in the recognition of 20 Ingrian-language textbooks from the 1930s in a collaborative digitizing pilot involving the National Library of Finland, the University of Helsinki Library and the National Library of Russia in St. Petersburg.",
keywords = "6121 Languages, Ingrian language, OCR, HFST, Giellatekno, Kone Language Programme, Minority languages, Accessibility, Uralic languages, WORD-RECOGNITION",
author = "Jack Rueter",
year = "2012",
month = "11",
day = "6",
language = "English",

}

Initial OCR Word Form List for Scanning of Ingrian, 1930s (2012-11-06). Rueter, Jack (Author). 2012.

Research output: Non-textual formSoftwareScientific

TY - ADVS

T1 - Initial OCR Word Form List for Scanning of Ingrian, 1930s (2012-11-06)

AU - Rueter, Jack

PY - 2012/11/6

Y1 - 2012/11/6

N2 - OCR programs can reach a higher level of accuracy when supplied with a list of word forms representing high frequency in a given text. This initial word-form list (2012-11-06) has been derived from the words available in a grammar of the Ingrian language by V. Junus, "Iƶoran keelen GRAMMATIKKA" (UCPEDGIZ, Leningrad-Moskva, 1936). The actual word forms appearing in the text have been described in a finite-state transducer in lemma-stem pairs with continuation lexica to cover morphological variation. The transducer has then been used to produce over 180 thousand random forms to enhance the list of 8122 unique original word forms. The resulting list will soon be applied in the recognition of 20 Ingrian-language textbooks from the 1930s in a collaborative digitizing pilot involving the National Library of Finland, the University of Helsinki Library and the National Library of Russia in St. Petersburg.

AB - OCR programs can reach a higher level of accuracy when supplied with a list of word forms representing high frequency in a given text. This initial word-form list (2012-11-06) has been derived from the words available in a grammar of the Ingrian language by V. Junus, "Iƶoran keelen GRAMMATIKKA" (UCPEDGIZ, Leningrad-Moskva, 1936). The actual word forms appearing in the text have been described in a finite-state transducer in lemma-stem pairs with continuation lexica to cover morphological variation. The transducer has then been used to produce over 180 thousand random forms to enhance the list of 8122 unique original word forms. The resulting list will soon be applied in the recognition of 20 Ingrian-language textbooks from the 1930s in a collaborative digitizing pilot involving the National Library of Finland, the University of Helsinki Library and the National Library of Russia in St. Petersburg.

KW - 6121 Languages

KW - Ingrian language

KW - OCR

KW - HFST

KW - Giellatekno

KW - Kone Language Programme

KW - Minority languages

KW - Accessibility

KW - Uralic languages

KW - WORD-RECOGNITION

M3 - Software

ER -