The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.
Originalspråkengelska
Titel på gästpublikationFourth Conference on Machine Translation : Proceedings of the Conference: Volume 3: Shared Task Papers, Day 2
RedaktörerOndřej Bojar, Rajen Chatterjee, Christian Federmann, et al.
Antal sidor7
UtgivningsortStroudsburg
FörlagAssociation for Computational Linguistics
Utgivningsdatum29 jul 2019
Sidor294-300
ISBN (elektroniskt)978-1-950737-27-7
StatusPublicerad - 29 jul 2019
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangConference on Machine Translation: WMT19 - Florence, Italien
Varaktighet: 1 aug 20192 aug 2019
Konferensnummer: 4

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap
  • 6121 Språkvetenskaper

Citera det här

Vazquez , R., Sulubacak, U., & Tiedemann, J. (2019). The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task. I O. Bojar, R. Chatterjee, C. Federmann, & E. A. (Red.), Fourth Conference on Machine Translation: Proceedings of the Conference: Volume 3: Shared Task Papers, Day 2 (s. 294-300). Stroudsburg: Association for Computational Linguistics.
Vazquez , Raul ; Sulubacak, Umut ; Tiedemann, Jörg. / The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task. Fourth Conference on Machine Translation: Proceedings of the Conference: Volume 3: Shared Task Papers, Day 2. redaktör / Ondřej Bojar ; Rajen Chatterjee ; Christian Federmann ; et al. Stroudsburg : Association for Computational Linguistics, 2019. s. 294-300
@inproceedings{a0c5bbda78694c77a086191c61fd9119,
title = "The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task",
abstract = "This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.",
keywords = "113 Computer and information sciences, 6121 Languages",
author = "Raul Vazquez and Umut Sulubacak and J{\"o}rg Tiedemann",
year = "2019",
month = "7",
day = "29",
language = "English",
pages = "294--300",
editor = "Bojar, {Ondřej } and Chatterjee, {Rajen } and Federmann, {Christian } and {et al.}",
booktitle = "Fourth Conference on Machine Translation",
publisher = "Association for Computational Linguistics",
address = "United States",

}

Vazquez , R, Sulubacak, U & Tiedemann, J 2019, The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task. i O Bojar, R Chatterjee, C Federmann & EA (red), Fourth Conference on Machine Translation: Proceedings of the Conference: Volume 3: Shared Task Papers, Day 2. Association for Computational Linguistics, Stroudsburg, s. 294-300, Conference on Machine Translation, Florence, Italien, 01/08/2019.

The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task. / Vazquez , Raul; Sulubacak, Umut; Tiedemann, Jörg.

Fourth Conference on Machine Translation: Proceedings of the Conference: Volume 3: Shared Task Papers, Day 2. red. / Ondřej Bojar; Rajen Chatterjee; Christian Federmann; et al. Stroudsburg : Association for Computational Linguistics, 2019. s. 294-300.

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

TY - GEN

T1 - The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task

AU - Vazquez , Raul

AU - Sulubacak, Umut

AU - Tiedemann, Jörg

PY - 2019/7/29

Y1 - 2019/7/29

N2 - This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.

AB - This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.

KW - 113 Computer and information sciences

KW - 6121 Languages

M3 - Conference contribution

SP - 294

EP - 300

BT - Fourth Conference on Machine Translation

A2 - Bojar, Ondřej

A2 - Chatterjee, Rajen

A2 - Federmann, Christian

A2 - null, et al.

PB - Association for Computational Linguistics

CY - Stroudsburg

ER -

Vazquez R, Sulubacak U, Tiedemann J. The University of Helsinki submission to the WMT19 Parallel Corpus Filtering Task. I Bojar O, Chatterjee R, Federmann C, EA, redaktörer, Fourth Conference on Machine Translation: Proceedings of the Conference: Volume 3: Shared Task Papers, Day 2. Stroudsburg: Association for Computational Linguistics. 2019. s. 294-300