Initial OCR Word Form List for Scanning of Ingrian, 1930s (2012-11-06)

    Research output: Non-textual formSoftwareScientific


    OCR programs can reach a higher level of accuracy when supplied with a list of word forms representing high frequency in a given text. This initial word-form list (2012-11-06) has been derived from the words available in a grammar of the Ingrian language by V. Junus, "Iƶoran keelen GRAMMATIKKA" (UCPEDGIZ, Leningrad-Moskva, 1936). The actual word forms appearing in the text have been described in a finite-state transducer in lemma-stem pairs with continuation lexica to cover morphological variation. The transducer has then been used to produce over 180 thousand random forms to enhance the list of 8122 unique original word forms. The resulting list will soon be applied in the recognition of 20 Ingrian-language textbooks from the 1930s in a collaborative digitizing pilot involving the National Library of Finland, the University of Helsinki Library and the National Library of Russia in St. Petersburg.
    Original languageEnglish
    Publication statusPublished - 6 Nov 2012
    MoE publication typeI2 ICT software

    Fields of Science

    • 6121 Languages
    • Ingrian language
    • OCR
    • HFST
    • Giellatekno
    • Kone Language Programme
    • Minority languages
    • Accessibility
    • Uralic languages

    Cite this