Projecting named entity recognizers without annotated or parallel corpora

Jue Hou, Maximilian Koppatz, Jose María Hoya Quecedo, Roman Yangarber

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.
Original languageEnglish
Title of host publication22nd Nordic Conference on Computational Linguistics (NoDaLiDa) : Proceedings of the Conference
EditorsMareike Hartmann, Barbara Plank
Number of pages10
Place of PublicationLinköping
PublisherLinköping University Electronic Press
Publication dateOct 2019
Pages232-241
ISBN (Electronic)978-91-7929-995-8
Publication statusPublished - Oct 2019
MoE publication typeA4 Article in conference proceedings
EventNordic Conference on Computational Linguistics - Turku, Finland
Duration: 30 Sept 20192 Oct 2019
Conference number: 22

Publication series

NameLinköping Electronic Conference Proceedings
PublisherLinköping University Electronic Press
Number67
ISSN (Print)1650-3686
ISSN (Electronic)1650-3740
NameNEALT Proceedings Series
PublisherLinköping University Electronic Press
Number42

Fields of Science

  • 113 Computer and information sciences
  • 6121 Languages

Cite this