The Development of a Comprehensive Data Set for Systematic Studies of Machine Translation

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKapitelVetenskapligPeer review


This paper presents our on-going efforts to develop a com-
prehensive data set and benchmark for machine translation beyond high-
resource languages. The current release includes 500GB of compressed
parallel data for almost 3,000 language pairs covering over 500 languages
and language variants. We present the structure of the data set and
demonstrate its use for systematic studies based on baseline experiments
with multilingual neural machine translation between Uralic languages
and other language groups. Our initial results show the capabilities of
training effective multilingual translation models with skewed training
data but also stress the shortcomings with low-resource settings and
the difficulties to obtain sufficient information through straightforward
transfer from related languages.
Titel på värdpublikationMultilingual Facilitation
RedaktörerMika Hämäläinen, Niko Partanen, Khalid Alnajjar
Antal sidor15
FörlagUniversity of Helsinki
ISBN (tryckt)979-871-33-6227-0
ISBN (elektroniskt)978-951-51-5025-7
StatusPublicerad - 2021
MoE-publikationstypA3 Del av bok eller annan forskningsbok


  • 113 Data- och informationsvetenskap
  • 6121 Språkvetenskaper

Citera det här