Towards automatic geolocalisation of speakers of European French

Yves Scherrer, Jean-Philippe Goldman

Research output: Conference materialsAbstract

Abstract

Starting in 2015, Avanzi et al. (2016) have launched several online surveys to inquire about regionalisms in European French (France, Belgium and Switzerland). Here, we investigate the use of data from these surveys for automatic speaker geolocalisation, both as a playful incentive to attract participants for further inquiries and as a scientific analysis method of the already collected data. Following Leemann et al. (2016), the problem of automatic speaker geolocalisation consists in predicting the dialect/regiolect of a speaker (typically, a speaker that has not participated in the survey) by asking a set of questions (typically, a small subset of the surveyed variables). Given our motivations, the success of a speaker geolocalisation method should not only be assessed by the percentage of correct answers, but also by its ability to entertain and surprise potential participants. Three parameters influence this success: - The number and type of questions to be asked. No more than 20 questions should be asked to keep the attention span short. - The number and type of the areas to predict. The areas should reflect the reduced amount of regional variation in current French, but too large areas could make the problem look trivial and uninteresting. - The accuracy of the predictions. The method obviously should make as good predictions as possible, but we estimate that about 2/3 of correct predictions are required for a sustainable level of participant involvement. We present a simulation framework that allows us to evaluate different parameter settings, using solely the survey data in a leave-one-out fashion. In a first set of two experiments, we start by determining an areal partition based on political or on linguistic criteria (e.g. hierarchical clustering), and then apply the shibboleth detection algorithm of Prokić et al. (2012) to find the most characteristic set of questions for each area. In a second experiment, we do not fix the areal partition in advance, but keep the original localisation information (i.e., départements, provinces or cantons). In order to find the optimal set of questions, we use recursive feature elimination (Guyon et al. 2002). Once the questions are determined, we dynamically expand the predictions to n-best areas or neighbors. With both methods, we reach the desired accuracy threshold with comparable area sizes and number of variables (about 20). However, the variables selected by the second approach intuitively correspond better to the variation patterns observed in the original survey data.
Original languageEnglish
Publication statusPublished - 9 Jun 2017
EventInternational Conference on Language Variation in Europe - Malaga, Spain
Duration: 6 Jun 20179 Jun 2017
Conference number: 9

Conference

ConferenceInternational Conference on Language Variation in Europe
CountrySpain
CityMalaga
Period06/06/201709/06/2017

Fields of Science

  • 6121 Languages

Projects

DFS2: Donnez votre français à la science!

Scherrer, Y., Glikman, J., Boula de Mareüil, P., Benzitoun, C., Avanzi, M., Bernhard, D. & Gerbet, R.

01/09/201731/08/2018

Project: Research project

DFS: Donnez votre français à la science!

Scherrer, Y., Glikman, J., Avanzi, M., Benzitoun, C., Boula de Mareüil, P., Goldman, J. & Thibault, A.

01/07/201631/03/2017

Project: Research project

Cite this

Scherrer, Y., & Goldman, J-P. (2017). Towards automatic geolocalisation of speakers of European French. Abstract from International Conference on Language Variation in Europe, Malaga, Spain.