Abstract
Instruction tuning a large language model with multiple languages can prepare it for multilingual downstream tasks. Nonetheless, it is yet to be determined whether having a handful of languages is sufficient, or whether the benefits increase with the inclusion of more. By finetuning large multilingual models on 1 to 52 languages, we present a case study on BLOOM to understand three pertinent factors affecting performance: the number of languages, language exposure, and similarity between training and test languages. Overall we found that 1) expanding language coverage in multilingual instruction tuning proves to be beneficial; 2) accuracy often significantly boots if the test language appears in the instruction mixture; 3) languages' genetic features correlate with cross-lingual transfer more than merely the number of language but different languages benefit to various degrees.
Original language | English |
---|---|
Title of host publication | Proceedings of the 31st International Conference on Computational Linguistics |
Editors | Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert |
Number of pages | 7 |
Place of Publication | Stroudsburg |
Publisher | Association for Computational Linguistics (ACL) |
Publication date | 2025 |
Pages | 2575-2581 |
ISBN (Electronic) | 979-8-89176-196-4 |
Publication status | Published - 2025 |
MoE publication type | A4 Article in conference proceedings |
Event | International Conference on Computational Linguistics - Abu Dhabi, United Arab Emirates Duration: 19 Jan 2025 → 24 Jan 2025 Conference number: 31 https://coling2025.org |
Publication series
Name | International Conference on Computational Linguistics |
---|---|
Publisher | Association for Computational Linguistics |
ISSN (Print) | 2951-2093 |
Fields of Science
- 6121 Languages
- 113 Computer and information sciences