The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task

Research output: Contribution to journalConference articleScientificpeer-review

Abstract

This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.
Original languageEnglish
JournalProceedings of the Annual Meeting of the Association for Computational Linguistics
Publication statusPublished - 29 Jul 2018
MoE publication typeA4 Article in conference proceedings
EventFourth Conference on Machine Translation: WMT19 - Firenze, Italy
Duration: 1 Aug 20192 Aug 2019
Conference number: 4

Cite this

@article{a0c5bbda78694c77a086191c61fd9119,
title = "The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task",
abstract = "This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.",
author = "Raul Vazquez and Umut Sulubacak and J{\"o}rg Tiedemann",
year = "2018",
month = "7",
day = "29",
language = "English",
journal = "Proceedings of the Annual Meeting of the Association for Computational Linguistics",

}

TY - JOUR

T1 - The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task

AU - Vazquez , Raul

AU - Sulubacak, Umut

AU - Tiedemann, Jörg

PY - 2018/7/29

Y1 - 2018/7/29

N2 - This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.

AB - This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.

M3 - Conference article

JO - Proceedings of the Annual Meeting of the Association for Computational Linguistics

JF - Proceedings of the Annual Meeting of the Association for Computational Linguistics

ER -