Projecting named entity recognizers without annotated or parallel corpora

Jue Hou, Maximilian Koppatz, Jose María Hoya Quecedo, Roman Yangarber

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.
Originalspråkengelska
Titel på gästpublikationProceedings of the 22st Nordic Conference on Computational Linguistics (NoDaLiDa)
RedaktörerMareike Hartmann, Barbara Plank
Antal sidor10
UtgivningsortTurku
FörlagLinköping University Electronic Press
Utgivningsdatumokt 2019
Sidor232-241
ISBN (elektroniskt)978-91-7929-995-8
StatusPublicerad - okt 2019
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangNordic Conference on Computational Linguistics - Turku, Finland
Varaktighet: 30 sep 20192 okt 2019
Konferensnummer: 22

Publikationsserier

NamnLinköping Electronic Conference Proceedings
FörlagLinköping University Electronic Press
Nummer67
ISSN (tryckt)1650-3686
ISSN (elektroniskt)1650-3740
NamnNEALT Proceedings Series
FörlagLinköping University Electronic Press
Nummer42

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap
  • 6121 Språkvetenskaper

Citera det här

Hou, J., Koppatz, M., Hoya Quecedo, J. M., & Yangarber, R. (2019). Projecting named entity recognizers without annotated or parallel corpora. I M. Hartmann, & B. Plank (Red.), Proceedings of the 22st Nordic Conference on Computational Linguistics (NoDaLiDa) (s. 232-241). (Linköping Electronic Conference Proceedings; Nr. 67), (NEALT Proceedings Series; Nr. 42). Turku: Linköping University Electronic Press.
Hou, Jue ; Koppatz, Maximilian ; Hoya Quecedo, Jose María ; Yangarber, Roman. / Projecting named entity recognizers without annotated or parallel corpora. Proceedings of the 22st Nordic Conference on Computational Linguistics (NoDaLiDa). redaktör / Mareike Hartmann ; Barbara Plank. Turku : Linköping University Electronic Press, 2019. s. 232-241 (Linköping Electronic Conference Proceedings; 67). (NEALT Proceedings Series; 42).
@inproceedings{7806aa5816c84b4d9d67a1349caf5361,
title = "Projecting named entity recognizers without annotated or parallel corpora",
abstract = "Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.",
keywords = "113 Computer and information sciences, 6121 Languages",
author = "Jue Hou and Maximilian Koppatz and {Hoya Quecedo}, {Jose Mar{\'i}a} and Roman Yangarber",
year = "2019",
month = "10",
language = "English",
series = "Link{\"o}ping Electronic Conference Proceedings",
publisher = "Link{\"o}ping University Electronic Press",
number = "67",
pages = "232--241",
editor = "Mareike Hartmann and Barbara Plank",
booktitle = "Proceedings of the 22st Nordic Conference on Computational Linguistics (NoDaLiDa)",
address = "Sweden",

}

Hou, J, Koppatz, M, Hoya Quecedo, JM & Yangarber, R 2019, Projecting named entity recognizers without annotated or parallel corpora. i M Hartmann & B Plank (red), Proceedings of the 22st Nordic Conference on Computational Linguistics (NoDaLiDa). Linköping Electronic Conference Proceedings, nr. 67, NEALT Proceedings Series, nr. 42, Linköping University Electronic Press, Turku, s. 232-241, Nordic Conference on Computational Linguistics, Turku, Finland, 30/09/2019.

Projecting named entity recognizers without annotated or parallel corpora. / Hou, Jue; Koppatz, Maximilian; Hoya Quecedo, Jose María; Yangarber, Roman.

Proceedings of the 22st Nordic Conference on Computational Linguistics (NoDaLiDa). red. / Mareike Hartmann; Barbara Plank. Turku : Linköping University Electronic Press, 2019. s. 232-241 (Linköping Electronic Conference Proceedings; Nr. 67), (NEALT Proceedings Series; Nr. 42).

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

TY - GEN

T1 - Projecting named entity recognizers without annotated or parallel corpora

AU - Hou, Jue

AU - Koppatz, Maximilian

AU - Hoya Quecedo, Jose María

AU - Yangarber, Roman

PY - 2019/10

Y1 - 2019/10

N2 - Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.

AB - Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.

KW - 113 Computer and information sciences

KW - 6121 Languages

M3 - Conference contribution

T3 - Linköping Electronic Conference Proceedings

SP - 232

EP - 241

BT - Proceedings of the 22st Nordic Conference on Computational Linguistics (NoDaLiDa)

A2 - Hartmann, Mareike

A2 - Plank, Barbara

PB - Linköping University Electronic Press

CY - Turku

ER -

Hou J, Koppatz M, Hoya Quecedo JM, Yangarber R. Projecting named entity recognizers without annotated or parallel corpora. I Hartmann M, Plank B, redaktörer, Proceedings of the 22st Nordic Conference on Computational Linguistics (NoDaLiDa). Turku: Linköping University Electronic Press. 2019. s. 232-241. (Linköping Electronic Conference Proceedings; 67). (NEALT Proceedings Series; 42).