Unsupervised Feature Selection for Effective Parallel Corpus Filtering

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

This work presents an unsupervised method of selecting filters and threshold values for the OpusFilter parallel corpus cleaning toolbox. The method clusters sentence pairs into noisy and clean categories and uses the features of the noisy cluster center as filtering parameters. Our approach utilizes feature importance analysis to disregard filters that do not differentiate between clean and noisy data. A randomly sampled subset of a given corpus is used for filter selection and ineffective filters are not run for the full corpus. We use a set of automatic evaluation metrics to assess the quality of translation models trained with data filtered by our method and data filtered with OpusFilter’s default parameters. The trained models cover English-German and English-Ukrainian in both directions. The proposed method outperforms the default parameters in all translation directions for almost all evaluation metrics.
Original languageEnglish
Title of host publicationProceedings of the 24th Annual Conference of the European Association for Machine Translation
EditorsMary Nurminen, Judith Brenner, Maarit Koponen, et al.
Number of pages8
Place of PublicationGeneva
PublisherEuropean Association for Machine Translation
Publication dateJun 2023
Pages31-38
ISBN (Electronic)978-952-03-2947-1
Publication statusPublished - Jun 2023
MoE publication typeA4 Article in conference proceedings
EventAnnual Conference of The European Association for Machine Translation - Tampere, Finland
Duration: 12 Jun 202315 Jun 2023
https://events.tuni.fi/eamt23/

Fields of Science

  • 6121 Languages
  • 113 Computer and information sciences

Cite this