Projects per year
Abstract
This work presents an unsupervised method of selecting filters and threshold values for the OpusFilter parallel corpus cleaning toolbox. The method clusters sentence pairs into noisy and clean categories and uses the features of the noisy cluster center as filtering parameters. Our approach utilizes feature importance analysis to disregard filters that do not differentiate between clean and noisy data. A randomly sampled subset of a given corpus is used for filter selection and ineffective filters are not run for the full corpus. We use a set of automatic evaluation metrics to assess the quality of translation models trained with data filtered by our method and data filtered with OpusFilter’s default parameters. The trained models cover English-German and English-Ukrainian in both directions. The proposed method outperforms the default parameters in all translation directions for almost all evaluation metrics.
Original language | English |
---|---|
Title of host publication | Proceedings of the 24th Annual Conference of the European Association for Machine Translation |
Editors | Mary Nurminen, Judith Brenner, Maarit Koponen, et al. |
Number of pages | 8 |
Place of Publication | Geneva |
Publisher | European Association for Machine Translation |
Publication date | Jun 2023 |
Pages | 31-38 |
ISBN (Electronic) | 978-952-03-2947-1 |
Publication status | Published - Jun 2023 |
MoE publication type | A4 Article in conference proceedings |
Event | Annual Conference of The European Association for Machine Translation - Tampere, Finland Duration: 12 Jun 2023 → 15 Jun 2023 https://events.tuni.fi/eamt23/ |
Fields of Science
- 6121 Languages
- 113 Computer and information sciences
Projects
- 1 Active
-
High Performance Language Technologies
Tiedemann, J. (Project manager), Aulamo, M. (Participant), De Gibert Bonet, O. (Participant), Grönroos, S.-A. (Participant), Ji, S. (Participant) & Virpioja, S. P. (Participant)
Charles University in Prague Faculty of Science Department of Teaching and Didactics of Biology
01/09/2022 → 31/08/2025
Project: EU Horizon Europe: Innovation actions (HORIZON-IA)