This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset is based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 500 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set that can be shared with other researchers.
|Titel på gästpublikation||LREC 2008 : the Language Resources and Evaluation Conference|
|Status||Publicerad - 2008|
|MoE-publikationstyp||A4 Artikel i en konferenspublikation|