Digitising Swiss German: how to process and study a polycentric spoken language

Yves Scherrer, Tanja Samardžić, Elvira Glaser

Forskningsoutput: TidskriftsbidragArtikelVetenskapligPeer review

Sammanfattning

Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.
Originalspråkengelska
TidskriftLanguage Resources and Evaluation
Volym53
Utgåva4
Sidor (från-till)735-769
Antal sidor35
ISSN1574-020X
DOI
StatusPublicerad - 29 nov 2019
MoE-publikationstypA1 Tidskriftsartikel-refererad

Vetenskapsgrenar

  • 6121 Språkvetenskaper

Citera det här

@article{2c040e4bba2e4b9eb708fe85860936e6,
title = "Digitising Swiss German: how to process and study a polycentric spoken language",
abstract = "Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.",
keywords = "6121 Languages",
author = "Yves Scherrer and Tanja Samardžić and Elvira Glaser",
year = "2019",
month = "11",
day = "29",
doi = "10.1007/s10579-019-09457-5",
language = "English",
volume = "53",
pages = "735--769",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer",
number = "4",

}

Digitising Swiss German : how to process and study a polycentric spoken language. / Scherrer, Yves; Samardžić, Tanja; Glaser, Elvira.

I: Language Resources and Evaluation, Vol. 53, Nr. 4, 29.11.2019, s. 735-769.

Forskningsoutput: TidskriftsbidragArtikelVetenskapligPeer review

TY - JOUR

T1 - Digitising Swiss German

T2 - how to process and study a polycentric spoken language

AU - Scherrer, Yves

AU - Samardžić, Tanja

AU - Glaser, Elvira

PY - 2019/11/29

Y1 - 2019/11/29

N2 - Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.

AB - Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.

KW - 6121 Languages

U2 - 10.1007/s10579-019-09457-5

DO - 10.1007/s10579-019-09457-5

M3 - Article

VL - 53

SP - 735

EP - 769

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 4

ER -