The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.
Original languageEnglish
Title of host publicationFourth Conference on Machine Translation : Proceedings of the Conference: Volume 3: Shared Task Papers, Day 2
EditorsOndřej Bojar, Rajen Chatterjee, Christian Federmann, et al.
Number of pages7
Place of PublicationStroudsburg
PublisherThe Association for Computational Linguistics
Publication date29 Jul 2019
Pages294-300
ISBN (Electronic)978-1-950737-27-7
Publication statusPublished - 29 Jul 2019
MoE publication typeA4 Article in conference proceedings
EventConference on Machine Translation: WMT19 - Florence, Italy
Duration: 1 Aug 20192 Aug 2019
Conference number: 4

Fields of Science

  • 113 Computer and information sciences
  • 6121 Languages

Cite this

Vazquez , R., Sulubacak, U., & Tiedemann, J. (2019). The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task. In O. Bojar, R. Chatterjee, C. Federmann, & E. A. (Eds.), Fourth Conference on Machine Translation: Proceedings of the Conference: Volume 3: Shared Task Papers, Day 2 (pp. 294-300). The Association for Computational Linguistics.