OCR programs can reach a higher level of accuracy when supplied with a list of word forms representing high frequency in a given text. This initial word-form list (2012-11-06) has been derived from the words available in a grammar of the Ingrian language by V. Junus, "Iƶoran keelen GRAMMATIKKA" (UCPEDGIZ, Leningrad-Moskva, 1936). The actual word forms appearing in the text have been described in a finite-state transducer in lemma-stem pairs with continuation lexica to cover morphological variation. The transducer has then been used to produce over 180 thousand random forms to enhance the list of 8122 unique original word forms. The resulting list will soon be applied in the recognition of 20 Ingrian-language textbooks from the 1930s in a collaborative digitizing pilot involving the National Library of Finland, the University of Helsinki Library and the National Library of Russia in St. Petersburg.
|Status||Publicerad - 6 nov. 2012|
- 6121 Språkvetenskaper