Integrating language technology into work with spoken corpora

Niko Partanen, Michael Riessler, Joshua Wilbur

Tutkimustuotos: KonferenssimateriaalitMuu konferenssimateriaalivertaisarvioitu


Our paper presents on-going work in several language documentation projects on endangered languages spoken in northern Fenno-Scandia and northern Russia. One feature of our work in comparison to many other similar documentation projects has been the systematic application of language technology in order to avoid manual corpus annotation. This is crucial, since the corpora we work with are so large that manual work can rarely annotate more than a small fraction of the whole corpus. These limitations are present in all parts of the workflow, except potentially the recording process itself. The systematic use of language technology is a logical solution to this manual-work bottleneck. However, its implementation is not necessarily always straightforward, so the approach we propose requires closer collaboration between researchers developing tools for Natural Language Processing and linguists. Our aim is to use the same system of analysis for spoken and written language, although they both come with their own set of particularities, mainly stemming from the original sources and formats. This requires specific choices concerning spoken language annotation, primarily the use of transcription systems that are compatible with or adaptable to existing tools. However, in our experience the use of contemporary orthographies also works very well with spoken language, especially when transcription accuracy is mainly at the phoneme level. A technical framework that we successfully integrate into our workflows is EMU (Winkelmann et. al. 2019), which allows reasonably accurate phoneme and word level segmentation from existing utterance-level annotations. From this point of view transcribing utterance level annotations is more than sufficient, as more accurate levels can be derived automatically. For linguistic annotation we rely primarily on Giellatekno tools (Trosterud 2006). Since these tools exist for, inter alia, a variety of Saamic languages, Finnish and Norwegian, researchers working on northern Eurasian languages should consider whether these tools can be integrated into their workflows. Our approach has been to input analyses as annotations directly added into ELAN files using Python (Authors 2017), and the tools have been implemented in later versions as an external web service using uralicNLP package (Hämäläinen 2018). We have also conducted tests with various dependency parsers, but in our low-resource scenario these have not (yet) resulted in viable solutions. Our third line of work involves speech recognition. The most exciting of these approaches has been the translation of Mozilla’s Common Voice platform into languages we work with, and we aim to use this tool to collect larger amounts of spoken data than is possible with normal means of recording and transcription. Since several projects have already reported successful results with speech recognition on endangered languages (Adams 2018; Foley et al 2018), it is only a matter of time before this will be available for our languages. However, since the magnitude of resources needed are beyond what one project can reasonably work with, wider collaboration and data sharing between research projects and institutions will be needed. References: Oliver Adams, Trevor Cohn, Graham Neubig, Hilaria Cruz, Steven Bird and Alexis Michaud 2018. Evaluating Phonemic Transcription of Low-Resource Tonal Languages for Language Documentation. In Proceedings of LREC 2018. B. Foley, J. Arnold, R. Coto-Solano, G. Durantin, T. M. Ellison, D. van Esch, S. Heath, F. Kratochvíl, Z. Maxwell-Smith, D. Nash, O. Olsson, M. Richards, N. San, H. Stoakes, N. Thieberger and J. Wiles 2018. Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (Elpis). In S. S. Agrawal (Ed.), The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU). 200–204. Mika Hämäläinen 2018. UralicNLP (Version v1.0). Zenodo. Trond Trosterud 2006. Grammatically based language technology for minority languages. In: Lesser-known languages of South Asia: Status and policies, case studies and applications of information technology, ed. by Anju Saxena & Lars Borin. Berlin: Mouton de Gruyter, 293–316. Raphael Winkelmann, Klaus Jaensch, Steve Cassidy and Jonathan Harrington 2019. emuR: Main Package of the EMU Speech Database Management System. R package version 1.1.2.
TilaJulkaistu - 16 elokuuta 2019
OKM-julkaisutyyppiEi sovellu
TapahtumaResearch Data and Humanities 2019 - University of Oulu, Oulu, Suomi
Kesto: 14 elokuuta 201916 elokuuta 2019


KonferenssiResearch Data and Humanities 2019
LyhennettäRDHum 2019


  • 6121 Kielitieteet
  • kieliteknologia
  • kielten dokumentointi
  • tutkimusmetodit
  • audiovisuaalinen aineisto

Siteeraa tätä