Projects per year
Abstract
Our paper presents on-going work in several language documentation projects on endangered languages spoken in northern Fenno-Scandia and northern Russia. One feature of our work in comparison to many other similar documentation projects has been the systematic application of language technology in order to avoid manual corpus annotation. This is crucial, since the corpora we work with are so large that manual work can rarely annotate more than a small fraction of the whole corpus. These limitations are present in all parts of the workflow, except potentially the recording process itself. The systematic use of language technology is a logical solution to this manual-work bottleneck. However, its implementation is not necessarily always straightforward, so the approach we propose requires closer collaboration between researchers developing tools for Natural Language Processing and linguists.
Our aim is to use the same system of analysis for spoken and written language, although they both come with their own set of particularities, mainly stemming from the original sources and formats. This requires specific choices concerning spoken language annotation, primarily the use of transcription systems that are compatible with or adaptable to existing tools. However, in our experience the use of contemporary orthographies also works very well with spoken language, especially when transcription accuracy is mainly at the phoneme level.
A technical framework that we successfully integrate into our workflows is EMU (Winkelmann et. al. 2019), which allows reasonably accurate phoneme and word level segmentation from existing utterance-level annotations. From this point of view transcribing utterance level annotations is more than sufficient, as more accurate levels can be derived automatically.
For linguistic annotation we rely primarily on Giellatekno tools (Trosterud 2006). Since these tools exist for, inter alia, a variety of Saamic languages, Finnish and Norwegian, researchers working on northern Eurasian languages should consider whether these tools can be integrated into their workflows. Our approach has been to input analyses as annotations directly added into ELAN files using Python (Authors 2017), and the tools have been implemented in later versions as an external web service using uralicNLP package (Hämäläinen 2018). We have also conducted tests with various dependency parsers, but in our low-resource scenario these have not (yet) resulted in viable solutions.
Our third line of work involves speech recognition. The most exciting of these approaches has been the translation of Mozilla’s Common Voice platform into languages we work with, and we aim to use this tool to collect larger amounts of spoken data than is possible with normal means of recording and transcription. Since several projects have already reported successful results with speech recognition on endangered languages (Adams 2018; Foley et al 2018), it is only a matter of time before this will be available for our languages. However, since the magnitude of resources needed are beyond what one project can reasonably work with, wider collaboration and data sharing between research projects and institutions will be needed.
References:
Oliver Adams, Trevor Cohn, Graham Neubig, Hilaria Cruz, Steven Bird and Alexis Michaud 2018. Evaluating Phonemic Transcription of Low-Resource Tonal Languages for Language Documentation. In Proceedings of LREC 2018.
B. Foley, J. Arnold, R. Coto-Solano, G. Durantin, T. M. Ellison, D. van Esch, S. Heath, F. Kratochvíl, Z. Maxwell-Smith, D. Nash, O. Olsson, M. Richards, N. San, H. Stoakes, N. Thieberger and J. Wiles 2018. Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (Elpis). In S. S. Agrawal (Ed.), The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU). 200–204.
Mika Hämäläinen 2018. UralicNLP (Version v1.0). Zenodo. http://doi.org/10.5281/zenodo.1143638
Trond Trosterud 2006. Grammatically based language technology for minority languages. In: Lesser-known languages of South Asia: Status and policies, case studies and applications of information technology, ed. by Anju Saxena & Lars Borin. Berlin: Mouton de Gruyter, 293–316.
Raphael Winkelmann, Klaus Jaensch, Steve Cassidy and Jonathan Harrington 2019. emuR: Main Package of the EMU Speech Database Management System. R package version 1.1.2.
Original language | Finnish |
---|---|
Publication status | Published - 16 Aug 2019 |
MoE publication type | Not Eligible |
Event | Research Data and Humanities 2019 - University of Oulu, Oulu, Finland Duration: 14 Aug 2019 → 16 Aug 2019 https://www.oulu.fi/suomenkieli/node/55261 |
Conference
Conference | Research Data and Humanities 2019 |
---|---|
Abbreviated title | RDHum 2019 |
Country/Territory | Finland |
City | Oulu |
Period | 14/08/2019 → 16/08/2019 |
Internet address |
Fields of Science
- 6121 Languages
Projects
- 1 Active
-
IKDP-2: Language Documentation meets Language Technology: The Next Step in the Description of Komi
Blokland, R., Rießler, M. & Partanen, N.
01/03/2017 → …
Project: Research project