Document-level Machine Translation Benchmark

Dataset

Description

This release contains data sets for experiments with document-level machine translation. The data sets have been used in previous studies and provided here for replicability and comparison with other systems. The data sets are taken from the English-German news translation task at WMT 2019 and the English-German bitext in the OpenSubtitles collection v2016 from OPUS. All data sets are sentence aligned with corresponding lines being aligned to each other. Document boundaries are marked with empty lines (on both sides of the parallel corpus).
Date made available1 Nov 2019
PublisherUniversity of Helsinki
Date of data production1 Jan 2017 - 1 Nov 2019

Cite this