Boosting Neural Machine Translation from Finnish to Northern Sámi with Rule-Based Backtranslation

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

We consider a low-resource translation task from Finnish into Northern Sámi. Collecting all available parallel data between the languages, we obtain around 30,000 sentence pairs. However, there exists a significantly larger monolingual Northern Sámi corpus, as well as a rule-based machine translation (RBMT) system between the languages. To make the best use of the monolingual data in a neural machine translation (NMT) system, we use the backtranslation approach to create synthetic parallel data from it using both NMT and RBMT systems. Evaluating the results on an in-domain test set and a small out-of-domain set, we find that the RBMT backtranslation outperforms NMT backtranslation clearly for the out-of-domain test set, but also slightly for the in-domain data, for which the NMT backtranslation model provided clearly better BLEU scores than the RBMT. In addition, combining both backtranslated data sets improves the RBMT approach only for the in-domain test set. This suggests that the RBMT system provides general-domain knowledge that cannot be found from the relative small parallel training data.
Originalspråkengelska
Titel på värdpublikationProceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
RedaktörerSimon Dobnik, Lilja Øvrelid
Antal sidor6
UtgivningsortLinköping
FörlagLinköping University Electronic Press
Utgivningsdatummaj 2021
Sidor351-356
ISBN (elektroniskt)978-91-7929-614-8
StatusPublicerad - maj 2021
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangNordic Conference on Computational Linguistics - [Online event], Reykjavik, Island
Varaktighet: 31 maj 20212 juni 2021
Konferensnummer: 23
https://nodalida2021.github.io/index.html

Publikationsserier

NamnLinköping Electronic Conference Proceedings
FörlagLinköping University Electronic Press
Nummer78
ISSN (tryckt)1650-3686
ISSN (elektroniskt)1650-3740
NamnNEALT Proceedings Series
FörlagUniversity of Tartu
Nummer45
ISSN (tryckt)1736-8197
ISSN (elektroniskt)1736-6305

Vetenskapsgrenar

  • 6121 Språkvetenskaper
  • 113 Data- och informationsvetenskap

Citera det här