FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish

Miikka Silfverberg, Teemu Ruokolainen, Krister Linden, Mikko Kurimo

Research output: Contribution to journalArticleScientificpeer-review

Abstract

This paper describes FinnPos, an open-source morphological tagging and lemmatization toolkit for Finnish. The morphological tagging model is based on the averaged structured perceptron classifier. Given training data, new taggers are estimated in a computationally efficient manner using a combination of beam search and model cascade. The lemmatization is performed employing a combination of a rule-based morphological analyzer, OMorFi, and a data-driven lemmatization model. The toolkit is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank. Empirical evaluation on these corpora shows that FinnPos performs favorably compared to reference systems in terms of tagging and lemmatization accuracy. In addition, we demonstrate that our system is highly competitive with regard to computational efficiency of learning new models and assigning analyses to novel sentences.
Original languageEnglish
JournalLanguage Resources and Evaluation
Volume50
Issue number4
Pages (from-to)863-878
Number of pages16
ISSN1574-020X
DOIs
Publication statusPublished - Dec 2016
MoE publication typeA1 Journal article-refereed

Fields of Science

  • 113 Computer and information sciences
  • 6121 Languages

Cite this

@article{0789f0a3bba14d7da38be3b7960dc4de,
title = "FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish",
abstract = "This paper describes FinnPos, an open-source morphological tagging and lemmatization toolkit for Finnish. The morphological tagging model is based on the averaged structured perceptron classifier. Given training data, new taggers are estimated in a computationally efficient manner using a combination of beam search and model cascade. The lemmatization is performed employing a combination of a rule-based morphological analyzer, OMorFi, and a data-driven lemmatization model. The toolkit is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank. Empirical evaluation on these corpora shows that FinnPos performs favorably compared to reference systems in terms of tagging and lemmatization accuracy. In addition, we demonstrate that our system is highly competitive with regard to computational efficiency of learning new models and assigning analyses to novel sentences.",
keywords = "113 Computer and information sciences, 6121 Languages",
author = "Miikka Silfverberg and Teemu Ruokolainen and Krister Linden and Mikko Kurimo",
year = "2016",
month = "12",
doi = "10.1007/s10579-015-9326-3",
language = "English",
volume = "50",
pages = "863--878",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer",
number = "4",

}

FinnPos : an open-source morphological tagging and lemmatization toolkit for Finnish. / Silfverberg, Miikka; Ruokolainen, Teemu; Linden, Krister; Kurimo, Mikko.

In: Language Resources and Evaluation, Vol. 50, No. 4, 12.2016, p. 863-878.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - FinnPos

T2 - an open-source morphological tagging and lemmatization toolkit for Finnish

AU - Silfverberg, Miikka

AU - Ruokolainen, Teemu

AU - Linden, Krister

AU - Kurimo, Mikko

PY - 2016/12

Y1 - 2016/12

N2 - This paper describes FinnPos, an open-source morphological tagging and lemmatization toolkit for Finnish. The morphological tagging model is based on the averaged structured perceptron classifier. Given training data, new taggers are estimated in a computationally efficient manner using a combination of beam search and model cascade. The lemmatization is performed employing a combination of a rule-based morphological analyzer, OMorFi, and a data-driven lemmatization model. The toolkit is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank. Empirical evaluation on these corpora shows that FinnPos performs favorably compared to reference systems in terms of tagging and lemmatization accuracy. In addition, we demonstrate that our system is highly competitive with regard to computational efficiency of learning new models and assigning analyses to novel sentences.

AB - This paper describes FinnPos, an open-source morphological tagging and lemmatization toolkit for Finnish. The morphological tagging model is based on the averaged structured perceptron classifier. Given training data, new taggers are estimated in a computationally efficient manner using a combination of beam search and model cascade. The lemmatization is performed employing a combination of a rule-based morphological analyzer, OMorFi, and a data-driven lemmatization model. The toolkit is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank. Empirical evaluation on these corpora shows that FinnPos performs favorably compared to reference systems in terms of tagging and lemmatization accuracy. In addition, we demonstrate that our system is highly competitive with regard to computational efficiency of learning new models and assigning analyses to novel sentences.

KW - 113 Computer and information sciences

KW - 6121 Languages

U2 - 10.1007/s10579-015-9326-3

DO - 10.1007/s10579-015-9326-3

M3 - Article

VL - 50

SP - 863

EP - 878

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 4

ER -