New Developments in Tagging Pre-modern Orthodox Slavic Texts

Yves Scherrer, Achim Rabus, Susanne Mocken

Research output: Contribution to journalArticleScientificpeer-review

Abstract

Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.
Original languageEnglish
JournalScripta & e-Scripta
Volume18
Pages (from-to)9-33
Number of pages25
ISSN1312-238X
Publication statusPublished - 2018
MoE publication typeA1 Journal article-refereed
EventInternational Congress of Slavists - University of Belgrade, Faculty of Philology, Belgrade, Serbia
Duration: 20 Aug 201827 Aug 2018
Conference number: 16
http://mks2018.fil.bg.ac.rs

Fields of Science

  • 6121 Languages

Cite this

Scherrer, Yves ; Rabus, Achim ; Mocken, Susanne. / New Developments in Tagging Pre-modern Orthodox Slavic Texts. In: Scripta & e-Scripta. 2018 ; Vol. 18. pp. 9-33.
@article{b1846bb7aa2b401784ef962cc99a5036,
title = "New Developments in Tagging Pre-modern Orthodox Slavic Texts",
abstract = "Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90{\%} and more than 95{\%} tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.",
keywords = "6121 Languages",
author = "Yves Scherrer and Achim Rabus and Susanne Mocken",
year = "2018",
language = "English",
volume = "18",
pages = "9--33",
journal = "Scripta & e-Scripta",
issn = "1312-238X",
publisher = "IZDATELSKI TSENTR BOYAN PENEV",

}

New Developments in Tagging Pre-modern Orthodox Slavic Texts. / Scherrer, Yves; Rabus, Achim; Mocken, Susanne.

In: Scripta & e-Scripta, Vol. 18, 2018, p. 9-33.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - New Developments in Tagging Pre-modern Orthodox Slavic Texts

AU - Scherrer, Yves

AU - Rabus, Achim

AU - Mocken, Susanne

PY - 2018

Y1 - 2018

N2 - Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.

AB - Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.

KW - 6121 Languages

M3 - Article

VL - 18

SP - 9

EP - 33

JO - Scripta & e-Scripta

JF - Scripta & e-Scripta

SN - 1312-238X

ER -