Starting in 2015, Avanzi et al. (2016) have launched several online surveys to inquire about regionalisms in European French (France, Belgium and Switzerland). Here, we investigate the use of data from these surveys for automatic speaker geolocalisation, both as a playful incentive to attract participants for further inquiries and as a scientific analysis method of the already collected data. Following Leemann et al. (2016), the problem of automatic speaker geolocalisation consists in predicting the dialect/regiolect of a speaker (typically, a speaker that has not participated in the survey) by asking a set of questions (typically, a small subset of the surveyed variables). Given our motivations, the success of a speaker geolocalisation method should not only be assessed by the percentage of correct answers, but also by its ability to entertain and surprise potential participants. Three parameters influence this success: - The number and type of questions to be asked. No more than 20 questions should be asked to keep the attention span short. - The number and type of the areas to predict. The areas should reflect the reduced amount of regional variation in current French, but too large areas could make the problem look trivial and uninteresting. - The accuracy of the predictions. The method obviously should make as good predictions as possible, but we estimate that about 2/3 of correct predictions are required for a sustainable level of participant involvement. We present a simulation framework that allows us to evaluate different parameter settings, using solely the survey data in a leave-one-out fashion. In a first set of two experiments, we start by determining an areal partition based on political or on linguistic criteria (e.g. hierarchical clustering), and then apply the shibboleth detection algorithm of Prokić et al. (2012) to find the most characteristic set of questions for each area. In a second experiment, we do not fix the areal partition in advance, but keep the original localisation information (i.e., départements, provinces or cantons). In order to find the optimal set of questions, we use recursive feature elimination (Guyon et al. 2002). Once the questions are determined, we dynamically expand the predictions to n-best areas or neighbors. With both methods, we reach the desired accuracy threshold with comparable area sizes and number of variables (about 20). However, the variables selected by the second approach intuitively correspond better to the variation patterns observed in the original survey data.
|Tila||Julkaistu - 9 kesäkuuta 2017|
|Tapahtuma||International Conference on Language Variation in Europe - Malaga, Espanja|
Kesto: 6 kesäkuuta 2017 → 9 kesäkuuta 2017
|Konferenssi||International Conference on Language Variation in Europe|
|Ajanjakso||06/06/2017 → 09/06/2017|
- 6121 Kielitieteet