A New Massive Multilingual Dataset for High-Performance Language Technologies

Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Abstrakti

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

Alkuperäiskielienglanti
OtsikkoProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
ToimittajatNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Sivumäärä13
JulkaisupaikkaParis
KustantajaEuropean Language Resources Association (ELRA)
Julkaisupäivä2024
Sivut1116-1128
ISBN (elektroninen)978-2-493814-10-4
TilaJulkaistu - 2024
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaThe 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) - Torino, Italia
Kesto: 20 toukok. 202425 toukok. 2024

Julkaisusarja

NimiInternational conference on computational linguistics
KustantajaInternational Committee on Computational Linguistics
ISSN (painettu)2951-2093
NimiLREC proceedings
KustantajaLanguage Resources Association (ELRA)
ISSN (elektroninen)2522-2686

Lisätietoja

Publisher Copyright:
© 2024 ELRA Language Resource Association: CC BY-NC 4.0.

Tieteenalat

  • 6121 Kielitieteet
  • 113 Tietojenkäsittely- ja informaatiotieteet

Siteeraa tätä