Learning to lemmatize in the word representation space

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Abstrakti

Lemmatization is often used with morphologically rich languages to address issues caused by morphological complexity, performed by grammar-based lemmatizers. We propose an alternative for this, in form of a tool that performs lemmatization in the space of word embeddings. Word embeddings as distributed representations natively encode some information about the relationship between base and inflected forms, and we show that it is possible to learn a transformation that approximately maps the embeddings of inflected forms to the embeddings of the corresponding lemmas. This facilitates an alternative processing pipeline that replaces traditional lemmatization with the lemmatizing transformation in downstream processing for any application. We demonstrate the method in the Finnish language, outperforming traditional lemmatizers in example task of document similarity comparison, but the approach is language independent and can be trained for new languages with mild requirements.
Alkuperäiskielienglanti
OtsikkoProceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
KustantajaLinköpings University Electronic Press
Julkaisupäivätoukok. 2021
Sivut249-258
ISBN (elektroninen)978-91-7929-614-8
TilaJulkaistu - toukok. 2021
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaNordic Conference on Computational Linguistics - [Online event], Reykjavik, Islanti
Kesto: 31 toukok. 20212 kesäk. 2021
Konferenssinumero: 23
https://nodalida2021.github.io/index.html

Julkaisusarja

NimiLinköping Electronic Conference Proceedings
KustantajaLinköping University Electronic Press
Vuosikerta178
ISSN (elektroninen)1650-3740
Nimi NEALT Proceedings Series
Numero45

Tieteenalat

  • 113 Tietojenkäsittely- ja informaatiotieteet

Siteeraa tätä