Learning to lemmatize in the word representation space

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

Lemmatization is often used with morphologically rich languages to address issues caused by morphological complexity, performed by grammar-based lemmatizers. We propose an alternative for this, in form of a tool that performs lemmatization in the space of word embeddings. Word embeddings as distributed representations natively encode some information about the relationship between base and inflected forms, and we show that it is possible to learn a transformation that approximately maps the embeddings of inflected forms to the embeddings of the corresponding lemmas. This facilitates an alternative processing pipeline that replaces traditional lemmatization with the lemmatizing transformation in downstream processing for any application. We demonstrate the method in the Finnish language, outperforming traditional lemmatizers in example task of document similarity comparison, but the approach is language independent and can be trained for new languages with mild requirements.
Originalspråkengelska
Titel på värdpublikationProceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
FörlagLinköpings University Electronic Press
Utgivningsdatummaj 2021
Sidor249-258
ISBN (elektroniskt)978-91-7929-614-8
StatusPublicerad - maj 2021
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangNordic Conference on Computational Linguistics - [Online event], Reykjavik, Island
Varaktighet: 31 maj 20212 juni 2021
Konferensnummer: 23
https://nodalida2021.github.io/index.html

Publikationsserier

NamnLinköping Electronic Conference Proceedings
FörlagLinköping University Electronic Press
Volym178
ISSN (elektroniskt)1650-3740
Namn NEALT Proceedings Series
Nummer45

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap

Citera det här