Erzya and Moksha Extended Corpora (ERME) version 2, Korp (beta) [text corpus]

Jack Rueter, Olga Erina

Forskningsoutput: Icke-textbaserad outputProgramvaraVetenskaplig

Sammanfattning

ERME contains predominantly original Erzya and Moksha literature. It consists of several media publications from the 19th to the 20th century. ERME was mapped in Saransk in 1997-2004, while in Helsinki it has been mapped since 2004. The most basic format used is XML, with a granularity extending to chapter level. The goal is to create corpora with a granularity extending to word level with bibliographic reference to the sentence level.

The texts are scrambled at the paragraph level.

This new version contains the literature found in the older instance and has grown markedly. While the old version was merely text divided to sentence level, the new version has lemmatization and dependencies. At sentence level contextual translation (English or Finnish translation) may be present, while at word level there is morphological encoding, corresponding to each context. Preliminary morpho-syntactic analysis is carried out using HFST-based transducers and Constraint Grammar disambiguation, function and dependency tagging, which have been developed in the Giellatekno infrastructure of the University of Tromsø.

The grammatical analysis and labeling comply with the practices developed in the Giellatekno infrastructure of the University of Tromsø. These practices are applied in the documentation of several Uralic languages.

Amount of processed material: more than 2.8 million words.

The amount of the processed material is to be increased subsequently. Future versions will strive to improve upon the morphological disambiguation of the corpus texts, the constraint-grammar assignment of functions, and the conversion from CG output to UD-type dependencies.
Originalspråkengelska
UtgivningsortHelsinki
FörlagKielipankki
Utgivningsformatinternet
Storlek289 735 sentences
StatusPublicerad - mars 2023
MoE-publikationstypI2 ICT-programvara

Vetenskapsgrenar

  • 6121 Språkvetenskaper

Citera det här