Projekteja vuodessa
Abstrakti
Bosnian, Croatian, Montenegrin and Serbian are the official standard linguistic varieties in Bosnia and Herzegovina, Croatia, Montenegro, and Serbia, respectively. When these four countries were part of the former Yugoslavia, the varieties were considered to share a single linguistic standard. After the individual countries were established, the national standards emerged. Today, a central question about these varieties remains the following: How different are they from each other? How hard is it to distinguish them? While this has been addressed in NLP as part of the task on Distinguishing Between Similar Languages (DSL), little is known about human performance, making it difficult to contextualize system results. We tackle this question by reannotating the existing BCMS dataset for DSL with annotators from all target regions. We release a new gold standard, replacing the original single-annotator, single-label annotation by a multi-annotator, multi-label one, thus improving annotation reliability and explicitly coding the existence of ambiguous instances. We reassess a previously proposed DSL system on the new gold standard and establish the human upper bound on the task. Finally, we identify sources of annotation difficulties and provide linguistic insights into the BCMS dialect continuum, with multiple indicators highlighting an intermediate position of Bosnian and Montenegrin.
Alkuperäiskieli | englanti |
---|---|
Otsikko | Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024 |
Toimittajat | Simone Balloccu, Anya Belz, Rudali Huidrom, Ehud Reiter, Joao Sedoc, Craig Thomson |
Sivumäärä | 11 |
Julkaisupaikka | Paris |
Kustantaja | European Language Resources Association (ELRA) |
Julkaisupäivä | 2024 |
Sivut | 36-46 |
ISBN (elektroninen) | 978-2-493814-41-8 |
Tila | Julkaistu - 2024 |
OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisuussa |
Tapahtuma | Workshop on Human Evaluation of NLP Systems - Torino, Italia Kesto: 21 toukok. 2024 → 21 toukok. 2024 Konferenssinumero: 4 |
Julkaisusarja
Nimi | International conference on computational linguistics |
---|---|
Kustantaja | International Committee on Computational Linguistics |
ISSN (painettu) | 2951-2093 |
Nimi | LREC proceedings |
---|---|
Kustantaja | European Language Resources Association (ELRA) |
ISSN (elektroninen) | 2522-2686 |
Lisätietoja
Publisher Copyright:© 2024 European Language Resources Association (ELRA).
Tieteenalat
- 6121 Kielitieteet
- 113 Tietojenkäsittely- ja informaatiotieteet
Projektit
- 1 Aktiivinen
-
CorCoDial: CorCoDial - Tekstikorpuksiin perustuva laskennallinen murretutkimus: konekäännöstekniikoiden hyödyntäminen murrealueiden ja murrepiirteiden löytämisessä, visualisoinnissa ja tulkitsemisessa
Scherrer, Y. (Projektinjohtaja), Tiedemann, J. (Projektinjohtaja), Mickus, T. (osallistuja), Miletic Haddad, A. (osallistuja), Psaltaki, E. (osallistuja), Roemling, D. (osallistuja), Siewert, J. (osallistuja) & Siewert, J. (osallistuja)
Suomen Akatemia Projektilaskutus
01/09/2021 → 31/08/2025
Projekti: Suomen Akatemia: Akatemiahanke