Projects per year
Abstract
This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consitency of the data collection.
Original language | English |
---|---|
Title of host publication | Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) |
Editors | Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis |
Number of pages | 8 |
Place of Publication | Paris |
Publisher | European Language Resources Association (ELRA) |
Publication date | 17 May 2020 |
Pages | 3782-3789 |
ISBN (Electronic) | 979-10-95546-34-4 |
Publication status | Published - 17 May 2020 |
MoE publication type | A4 Article in conference proceedings |
Event | Language Resources and Evaluation Conference - [LREC 2020 was cancelled] Duration: 11 May 2020 → 16 May 2020 Conference number: 12 https://lrec2020.lrec-conf.org/ |
Bibliographical note
12th Edition of its Language Resources and Evaluation Conference was cancelled due to Covid 19 pandemic.Fields of Science
- 113 Computer and information sciences
- Machine translation
- Corpus
- Multilingual
-
FoTran: Found in Translation - Natural Language Understanding with Cross-Lingual Grounding
Tiedemann, J., Celikkanat, H., Raganato, A., Silfverberg, M., Sulubacak, U., Vazquez , R., Apidianaki, M., Aulamo, M., Boggia, M., Celikkanat, H., De Gibert Bonet, O., Grönroos, S., Mickus, T., Raganato, A., Scherrer, Y., Silfverberg, M., Sjöblom, E. I., Talman, A., Vazquez , R., Virpioja, S. P., Yli-Jyrä, A. & Zosa, E.
01/09/2018 → 29/02/2024
Project: EU Horizon 2020: European Research Council: Consolidator Grant (H2020-ERC-COG)
-