A Large-Scale Study of Machine Translation in Turkic Languages

Jamshidbek Mirzakhalov, Anoop Babu, Duygu Ataman, Sherzod Kariev, Francis M. Tyers, Otabek Otabek Abduraufov, Mammad Hajili, Sardana Ivanova, Abror Khaytbaev, Antonio Laverghetta Jr., Bekhzodbek Moydinboyev, Esra Onal, Shaxnoza Pulatova, Ahsan Wahab, Orhan Firat, Sriram Chellappan

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Abstrakti

Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.
Alkuperäiskielienglanti
OtsikkoProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
KustantajaThe Association for Computational Linguistics
Julkaisupäivämarrask. 2021
Sivut5876-5890
ISBN (elektroninen)978-1-955917-09-4
DOI - pysyväislinkit
TilaJulkaistu - marrask. 2021
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaConference on Empirical Methods in Natural Language Processing - [Online conference], Punta Cana, Dominikaaninen tasavalta
Kesto: 7 marrask. 202111 marrask. 2021
https://2021.emnlp.org

Tieteenalat

  • 113 Tietojenkäsittely- ja informaatiotieteet

Siteeraa tätä