Projecting named entity recognizers without annotated or parallel corpora

Jue Hou, Maximilian Koppatz, Jose María Hoya Quecedo, Roman Yangarber

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Abstrakti

Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.
Alkuperäiskielienglanti
Otsikko22nd Nordic Conference on Computational Linguistics (NoDaLiDa) : Proceedings of the Conference
ToimittajatMareike Hartmann, Barbara Plank
Sivumäärä10
JulkaisupaikkaLinköping
KustantajaLinköping University Electronic Press
Julkaisupäivälokak. 2019
Sivut232-241
ISBN (elektroninen)978-91-7929-995-8
TilaJulkaistu - lokak. 2019
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaNordic Conference on Computational Linguistics - Turku, Suomi
Kesto: 30 syysk. 20192 lokak. 2019
Konferenssinumero: 22

Julkaisusarja

NimiLinköping Electronic Conference Proceedings
KustantajaLinköping University Electronic Press
Numero67
ISSN (painettu)1650-3686
ISSN (elektroninen)1650-3740
NimiNEALT Proceedings Series
KustantajaLinköping University Electronic Press
Numero42

Tieteenalat

  • 113 Tietojenkäsittely- ja informaatiotieteet
  • 6121 Kielitieteet

Siteeraa tätä