This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.
|Title of host publication||Fourth Conference on Machine Translation : Proceedings of the Conference: Volume 3: Shared Task Papers, Day 2|
|Editors||Ondřej Bojar, Rajen Chatterjee, Christian Federmann, et al.|
|Number of pages||7|
|Place of Publication||Stroudsburg|
|Publisher||The Association for Computational Linguistics|
|Publication date||29 Jul 2019|
|Publication status||Published - 29 Jul 2019|
|MoE publication type||A4 Article in conference proceedings|
|Event||Conference on Machine Translation: WMT19 - Florence, Italy|
Duration: 1 Aug 2019 → 2 Aug 2019
Conference number: 4
Fields of Science
- 113 Computer and information sciences
- 6121 Languages
Vazquez , R., Sulubacak, U., & Tiedemann, J. (2019). The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task. In O. Bojar, R. Chatterjee, C. Federmann, & E. A. (Eds.), Fourth Conference on Machine Translation: Proceedings of the Conference: Volume 3: Shared Task Papers, Day 2 (pp. 294-300). The Association for Computational Linguistics.