Projects per year
Abstract
We consider a low-resource translation task from Finnish into Northern Sámi. Collecting all available parallel data between the languages, we obtain around 30,000 sentence pairs. However, there exists a significantly larger monolingual Northern Sámi corpus, as well as a rule-based machine translation (RBMT) system between the languages. To make the best use of the monolingual data in a neural machine translation (NMT) system, we use the backtranslation approach to create synthetic parallel data from it using both NMT and RBMT systems. Evaluating the results on an in-domain test set and a small out-of-domain set, we find that the RBMT backtranslation outperforms NMT backtranslation clearly for the out-of-domain test set, but also slightly for the in-domain data, for which the NMT backtranslation model provided clearly better BLEU scores than the RBMT. In addition, combining both backtranslated data sets improves the RBMT approach only for the in-domain test set. This suggests that the RBMT system provides general-domain knowledge that cannot be found from the relative small parallel training data.
Original language | English |
---|---|
Title of host publication | Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) |
Editors | Simon Dobnik, Lilja Øvrelid |
Number of pages | 6 |
Place of Publication | Linköping |
Publisher | Linköping University Electronic Press |
Publication date | May 2021 |
Pages | 351-356 |
ISBN (Electronic) | 978-91-7929-614-8 |
Publication status | Published - May 2021 |
MoE publication type | A4 Article in conference proceedings |
Event | Nordic Conference on Computational Linguistics - [Online event], Reykjavik, Iceland Duration: 31 May 2021 → 2 Jun 2021 Conference number: 23 https://nodalida2021.github.io/index.html |
Publication series
Name | Linköping Electronic Conference Proceedings |
---|---|
Publisher | Linköping University Electronic Press |
Number | 78 |
ISSN (Print) | 1650-3686 |
ISSN (Electronic) | 1650-3740 |
Name | NEALT Proceedings Series |
---|---|
Publisher | University of Tartu |
Number | 45 |
ISSN (Print) | 1736-8197 |
ISSN (Electronic) | 1736-6305 |
Fields of Science
- 6121 Languages
- 113 Computer and information sciences
-
FoTran: Found in Translation - Natural Language Understanding with Cross-Lingual Grounding
Tiedemann, J., Celikkanat, H., Raganato, A., Silfverberg, M., Sulubacak, U., Vazquez , R., Apidianaki, M., Aulamo, M., Boggia, M., Celikkanat, H., De Gibert Bonet, O., Grönroos, S., Mickus, T., Raganato, A., Scherrer, Y., Silfverberg, M., Sjöblom, E. I., Talman, A., Vazquez , R., Virpioja, S. P., Yli-Jyrä, A. & Zosa, E.
01/09/2018 → 29/02/2024
Project: EU Horizon 2020: European Research Council: Consolidator Grant (H2020-ERC-COG)
-
OPUS-MT: Open Translation Models, Tools and Services
Aulamo, M., Nieminen, T. J., Hardwick, S. & Tiedemann, J.
01/08/2020 → 31/08/2021
Project: Research project