Kuvaus
Bosnian, Croatian, Montenegrin and Serbian are the official standard linguistic varieties in their respective countries: Bosnia and Herzegovina (3.3M inhabitants), Croatia (3.9M), Montenegro (0.6M) and Serbia (6.7M). When the four countries were part of the former Yugoslavia, these varieties were considered to belong to the same language, commonly referred to as Serbo-Croatian or Croato-Serbian. After the civil wars of the 1990s and the establishment of individual countries, national linguistic standards also emerged. Thirty years later, the effect of these extralinguistic events on the regional dialectal continuum remains underexplored.Some indirect observations on this issue resulted from work in Natural Language Processing, in particular on the task of Distnguishing Between Similar Languages (Zampieri et al. 2014, 2015, 2017; Malmasi et al. 2016). However, sociolinguistically motivated work on this topic is scarce. In one of the rare empirical studies available, Ljubešić et al. (2018) conduct corpus-based dialectometric research which examined the geographical distribution of 16 linguistic variables reflecting phonological, morpho-syntactic and lexical phenomena. The results indicate that Croatian and Serbian occupy the opposite ends of the continuum, whereas Bosnian and Montenegrin lean towards the one or the other depending on the variable. Note that the analyzed variables do not have the same geographical spread or the same frequency. The most frequent variable in the corpus encountered by Ljubešić et al. (2018) is the opposition between ekavian and ijekavian forms (e.g. dete in ekavian vs dijete in ijekavian, meaning `child’). This opposition allows to distinguish Serbian from the other three varieties, since it is the only one of the four national standards based on the ekavian pronunciation. This asymmetry can be expected to make identifying some varieties more difficult than others.
In this work, we present a questionnaire-based study (Dollinger 2016) on Bosnian, Croatian, Montenegrin and Serbian. Data was ellicited from 33 participants from all four target countries (25 female, 8 male; mean age: 44.6 years). The questionnaire was framed as an annotation task which consisted in guessing the country of the author of a given text. Annotation instances were based on an existing collection of social media texts from the four countries (Rupnik et al. 2023). Data collection ran from June 2023 to September 2023.
Based on the amount of text read and the duration of reading before the decision is reached, we show that identifying Serbian and Croatian is easier than identifying Bosnian and Montenegrin for annotators from all four countries. The annotators were also asked to systematically highlight spans of text that helped them decide. This yielded a total of approximately 5 thousand highlighted text spans, which we analyze further. We observe different usage patterns for indicators based on linguistic phenomena vs world knowledge. We also establish regularities with respect to the perceived usefulness of different lingusitic variables for identifying a given variety. To the best of our knowledge, this is the first perception study on these four linguistic varieties.
Aikajakso | 10 heinäk. 2024 |
---|---|
Tapahtuman otsikko | 12th International Conference on Language Variation in Europe – ICLaVE|12 |
Tapahtuman tyyppi | Konferenssi |
Sijainti | Wien, ItävaltaNäytä kartalla |
Tunnustuksen arvo | Kansainvälinen |
Asiakirjat ja linkit
Tähän liittyvä sisältö
-
Projektit