Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French

Jean-Philippe Goldman, Yves Scherrer, Julie Glikman, Mathieu Avanzi, Christophe Benzitoun, Philippe Boula de Mareüil

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

We present the crowdsourcing platform Donnez Votre Français à la Science (DFS, or “Give your French to Science”), which aims to collect linguistic data and document language use, with a special focus on regional variation in European French. The activities not only gather data that is useful for scientific studies, but they also provide feedback to the general public; this is important in order to reward participants, to encourage them to follow future surveys, and to foster interaction with the scientific community. The two main activities described here are 1) a linguistic survey on lexical variation with immediate feedback and 2) a speaker geolocalisation system; i.e., a quiz that guesses the linguistic origin of the participant by comparing their answers with previously gathered linguistic data. For the geolocalisation activity, we set up a simulation framework to optimise predictions. Three classification algorithms are compared: the first one uses clustering and shibboleth detection, whereas the other two rely on feature elimination techniques with Support Vector Machines and Maximum Entropy models as underlying base classifiers. The best-performing system uses a selection of 17 questions and reaches a localisation accuracy of 66%, extending the prediction from the one-best area (one among 109 base areas) to its first-order and second-order neighbouring areas.
Original languageEnglish
Title of host publicationProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
EditorsNicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga
Number of pages7
Place of PublicationParis
PublisherEuropean Language Resources Association (ELRA)
Publication dateMay 2018
Pages3336-3342
ISBN (Electronic)979-10-95546-00-9
Publication statusPublished - May 2018
MoE publication typeA4 Article in conference proceedings
EventInternational Conference on Language Resources and Evaluation - Phoenix Seagaia Resort, Miyazaki, Japan
Duration: 7 May 201812 May 2018
Conference number: 11
http://lrec2018.lrec-conf.org/en/

Fields of Science

  • 113 Computer and information sciences
  • 6121 Languages
  • language variation
  • regionalism
  • crowdsourcing
  • geolocalisation
  • linguistic geography
  • cartography

Cite this

Goldman, J-P., Scherrer, Y., Glikman, J., Avanzi, M., Benzitoun, C., & Boula de Mareüil, P. (2018). Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, ... T. Tokunaga (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 3336-3342). Paris: European Language Resources Association (ELRA).
Goldman, Jean-Philippe ; Scherrer, Yves ; Glikman, Julie ; Avanzi, Mathieu ; Benzitoun, Christophe ; Boula de Mareüil, Philippe. / Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). editor / Nicoletta Calzolari ; Khalid Choukri ; Christopher Cieri ; Thierry Declerck ; Sara Goggi ; Koiti Hasida ; Hitoshi Isahara ; Bente Maegaard ; Joseph Mariani ; Hélène Mazo ; Asuncion Moreno ; Jan Odijk ; Stelios Piperidis ; Takenobu Tokunaga. Paris : European Language Resources Association (ELRA), 2018. pp. 3336-3342
@inproceedings{d89d57b4941f4dd484575a10651ae887,
title = "Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French",
abstract = "We present the crowdsourcing platform Donnez Votre Fran{\cc}ais {\`a} la Science (DFS, or “Give your French to Science”), which aims to collect linguistic data and document language use, with a special focus on regional variation in European French. The activities not only gather data that is useful for scientific studies, but they also provide feedback to the general public; this is important in order to reward participants, to encourage them to follow future surveys, and to foster interaction with the scientific community. The two main activities described here are 1) a linguistic survey on lexical variation with immediate feedback and 2) a speaker geolocalisation system; i.e., a quiz that guesses the linguistic origin of the participant by comparing their answers with previously gathered linguistic data. For the geolocalisation activity, we set up a simulation framework to optimise predictions. Three classification algorithms are compared: the first one uses clustering and shibboleth detection, whereas the other two rely on feature elimination techniques with Support Vector Machines and Maximum Entropy models as underlying base classifiers. The best-performing system uses a selection of 17 questions and reaches a localisation accuracy of 66{\%}, extending the prediction from the one-best area (one among 109 base areas) to its first-order and second-order neighbouring areas.",
keywords = "113 Computer and information sciences, 6121 Languages, language variation, regionalism, crowdsourcing, geolocalisation, linguistic geography, cartography",
author = "Jean-Philippe Goldman and Yves Scherrer and Julie Glikman and Mathieu Avanzi and Christophe Benzitoun and {Boula de Mare{\"u}il}, Philippe",
year = "2018",
month = "5",
language = "English",
pages = "3336--3342",
editor = "{ Calzolari}, Nicoletta and Khalid Choukri and Christopher Cieri and Thierry Declerck and Goggi, {Sara } and Hasida, {Koiti } and Hitoshi Isahara and Bente Maegaard and Mariani, {Joseph } and Mazo, {H{\'e}l{\`e}ne } and Moreno, {Asuncion } and Jan Odijk and Piperidis, {Stelios } and Tokunaga, {Takenobu }",
booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)",
publisher = "European Language Resources Association (ELRA)",
address = "International",

}

Goldman, J-P, Scherrer, Y, Glikman, J, Avanzi, M, Benzitoun, C & Boula de Mareüil, P 2018, Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French. in N Calzolari, K Choukri, C Cieri, T Declerck, S Goggi, K Hasida, H Isahara, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk, S Piperidis & T Tokunaga (eds), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Paris, pp. 3336-3342, International Conference on Language Resources and Evaluation, Miyazaki, Japan, 07/05/2018.

Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French. / Goldman, Jean-Philippe; Scherrer, Yves; Glikman, Julie; Avanzi, Mathieu; Benzitoun, Christophe; Boula de Mareüil, Philippe.

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). ed. / Nicoletta Calzolari; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Koiti Hasida; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Hélène Mazo; Asuncion Moreno; Jan Odijk; Stelios Piperidis; Takenobu Tokunaga. Paris : European Language Resources Association (ELRA), 2018. p. 3336-3342.

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French

AU - Goldman, Jean-Philippe

AU - Scherrer, Yves

AU - Glikman, Julie

AU - Avanzi, Mathieu

AU - Benzitoun, Christophe

AU - Boula de Mareüil, Philippe

PY - 2018/5

Y1 - 2018/5

N2 - We present the crowdsourcing platform Donnez Votre Français à la Science (DFS, or “Give your French to Science”), which aims to collect linguistic data and document language use, with a special focus on regional variation in European French. The activities not only gather data that is useful for scientific studies, but they also provide feedback to the general public; this is important in order to reward participants, to encourage them to follow future surveys, and to foster interaction with the scientific community. The two main activities described here are 1) a linguistic survey on lexical variation with immediate feedback and 2) a speaker geolocalisation system; i.e., a quiz that guesses the linguistic origin of the participant by comparing their answers with previously gathered linguistic data. For the geolocalisation activity, we set up a simulation framework to optimise predictions. Three classification algorithms are compared: the first one uses clustering and shibboleth detection, whereas the other two rely on feature elimination techniques with Support Vector Machines and Maximum Entropy models as underlying base classifiers. The best-performing system uses a selection of 17 questions and reaches a localisation accuracy of 66%, extending the prediction from the one-best area (one among 109 base areas) to its first-order and second-order neighbouring areas.

AB - We present the crowdsourcing platform Donnez Votre Français à la Science (DFS, or “Give your French to Science”), which aims to collect linguistic data and document language use, with a special focus on regional variation in European French. The activities not only gather data that is useful for scientific studies, but they also provide feedback to the general public; this is important in order to reward participants, to encourage them to follow future surveys, and to foster interaction with the scientific community. The two main activities described here are 1) a linguistic survey on lexical variation with immediate feedback and 2) a speaker geolocalisation system; i.e., a quiz that guesses the linguistic origin of the participant by comparing their answers with previously gathered linguistic data. For the geolocalisation activity, we set up a simulation framework to optimise predictions. Three classification algorithms are compared: the first one uses clustering and shibboleth detection, whereas the other two rely on feature elimination techniques with Support Vector Machines and Maximum Entropy models as underlying base classifiers. The best-performing system uses a selection of 17 questions and reaches a localisation accuracy of 66%, extending the prediction from the one-best area (one among 109 base areas) to its first-order and second-order neighbouring areas.

KW - 113 Computer and information sciences

KW - 6121 Languages

KW - language variation

KW - regionalism

KW - crowdsourcing

KW - geolocalisation

KW - linguistic geography

KW - cartography

M3 - Conference contribution

SP - 3336

EP - 3342

BT - Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Hasida, Koiti

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Hélène

A2 - Moreno, Asuncion

A2 - Odijk, Jan

A2 - Piperidis, Stelios

A2 - Tokunaga, Takenobu

PB - European Language Resources Association (ELRA)

CY - Paris

ER -

Goldman J-P, Scherrer Y, Glikman J, Avanzi M, Benzitoun C, Boula de Mareüil P. Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French. In Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Paris: European Language Resources Association (ELRA). 2018. p. 3336-3342