Projecting named entity recognizers without annotated or parallel corpora

Jue Hou, Maximilian Koppatz, Jose María Hoya Quecedo, Roman Yangarber

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review


Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.
Titel på gästpublikation22nd Nordic Conference on Computational Linguistics (NoDaLiDa) : Proceedings of the Conference
RedaktörerMareike Hartmann, Barbara Plank
Antal sidor10
FörlagLinköping University Electronic Press
Utgivningsdatumokt 2019
ISBN (elektroniskt)978-91-7929-995-8
StatusPublicerad - okt 2019
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangNordic Conference on Computational Linguistics - Turku, Finland
Varaktighet: 30 sep 20192 okt 2019
Konferensnummer: 22


NamnLinköping Electronic Conference Proceedings
FörlagLinköping University Electronic Press
ISSN (tryckt)1650-3686
ISSN (elektroniskt)1650-3740
NamnNEALT Proceedings Series
FörlagLinköping University Electronic Press


  • 113 Data- och informationsvetenskap
  • 6121 Språkvetenskaper

Citera det här