A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian

Aleksandra Miletić, Filip Miletić

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Abstrakti

Bosnian, Croatian, Montenegrin and Serbian are the official standard linguistic varieties in Bosnia and Herzegovina, Croatia, Montenegro, and Serbia, respectively. When these four countries were part of the former Yugoslavia, the varieties were considered to share a single linguistic standard. After the individual countries were established, the national standards emerged. Today, a central question about these varieties remains the following: How different are they from each other? How hard is it to distinguish them? While this has been addressed in NLP as part of the task on Distinguishing Between Similar Languages (DSL), little is known about human performance, making it difficult to contextualize system results. We tackle this question by reannotating the existing BCMS dataset for DSL with annotators from all target regions. We release a new gold standard, replacing the original single-annotator, single-label annotation by a multi-annotator, multi-label one, thus improving annotation reliability and explicitly coding the existence of ambiguous instances. We reassess a previously proposed DSL system on the new gold standard and establish the human upper bound on the task. Finally, we identify sources of annotation difficulties and provide linguistic insights into the BCMS dialect continuum, with multiple indicators highlighting an intermediate position of Bosnian and Montenegrin.

Alkuperäiskielienglanti
OtsikkoProceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
ToimittajatSimone Balloccu, Anya Belz, Rudali Huidrom, Ehud Reiter, Joao Sedoc, Craig Thomson
Sivumäärä11
JulkaisupaikkaParis
KustantajaEuropean Language Resources Association (ELRA)
Julkaisupäivä2024
Sivut36-46
ISBN (elektroninen)978-2-493814-41-8
TilaJulkaistu - 2024
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaWorkshop on Human Evaluation of NLP Systems - Torino, Italia
Kesto: 21 toukok. 202421 toukok. 2024
Konferenssinumero: 4

Julkaisusarja

NimiInternational conference on computational linguistics
KustantajaInternational Committee on Computational Linguistics
ISSN (painettu)2951-2093
NimiLREC proceedings
KustantajaEuropean Language Resources Association (ELRA)
ISSN (elektroninen)2522-2686

Lisätietoja

Publisher Copyright:
© 2024 European Language Resources Association (ELRA).

Tieteenalat

  • 6121 Kielitieteet
  • 113 Tietojenkäsittely- ja informaatiotieteet

Siteeraa tätä