Initial OCR Word Form List for Scanning of Ingrian, 1930s (2012-11-06)

    Tutkimustuotos: Ei-tekstimuotoinenOhjelmistoTieteellinen

    Abstrakti

    OCR programs can reach a higher level of accuracy when supplied with a list of word forms representing high frequency in a given text. This initial word-form list (2012-11-06) has been derived from the words available in a grammar of the Ingrian language by V. Junus, "Iƶoran keelen GRAMMATIKKA" (UCPEDGIZ, Leningrad-Moskva, 1936). The actual word forms appearing in the text have been described in a finite-state transducer in lemma-stem pairs with continuation lexica to cover morphological variation. The transducer has then been used to produce over 180 thousand random forms to enhance the list of 8122 unique original word forms. The resulting list will soon be applied in the recognition of 20 Ingrian-language textbooks from the 1930s in a collaborative digitizing pilot involving the National Library of Finland, the University of Helsinki Library and the National Library of Russia in St. Petersburg.
    Alkuperäiskielienglanti
    TilaJulkaistu - 6 marrask. 2012
    OKM-julkaisutyyppiI2 Tieto- ja viestintätekniset sovellukset

    Tieteenalat

    • 6121 Kielitieteet

    Siteeraa tätä