Automatic validation and processing of ELAN corpora for spoken language data

Research output: Conference materialsPaperpeer-review


This presentation demonstrates and discusses workflows developed during the authors’ research with spoken corpora of various endangered languages of Northern Eurasia. In our projects, the spoken language material has been transcribed and translated in ELAN, resulting in parallel corpora with aligned audio (and in some cases video). Deeper linguistic annotations, such as morphosyntactic tagging, are also stored in the same files. ELAN is an open source tool that has become a quasi-standard for annotating spoken data in fieldwork-based documentation of endangered languages. For automatic parsing and tagging we have programmed rule-based analysers using the open source infrastructure of Giellatekno (Trosterud 2006), for which we have successfully developed interaction with ELAN.

This language technology-oriented practice has enabled our projects to focus on the systematization of primary data resources and their transcriptions and translations. Once the transcriptions are edited, the existing morphosyntactic annotations are overwritten with a new analysis matching with the current state of our parser. This solves, to a certain degree, one common problem in spoken language corpora of similar kinds: the annotations at the sentence and word level are hierarchically connected to each other, and editing transcriptions at higher levels will inevitably lead to changes at other levels. Maintaining this is very challenging and easily leads to inconsistencies in annotation.

Furthermore, ELAN allows a relatively lax and manually editable structure, which easily leads to very differently structured files. The software does not report on inconsistencies, which may result in difficulties when attempting to apply corpus queries across multiple ELAN files. The solution we have adapted has been to develop systematic testing scripts that verify that all files in the corpus actually do conform follow the same set of principles. In the case of ELAN, these are related to tier types, tier names and their combinations. This way, errors in the corpus structure can be detected immediately. Eventually, the testing should be conducted with the tools of continuous integration with automatic error reporting.

Besides validation, similar approaches can be applied to corpus parsing. Automatic parsing of corpus annotations and merging them with available metadata allows very rapid analysis of corpus content and structure, which also has numerous benefits when it comes to error detection. Presenting the corpus in a logical data structure within programming languages such as R and Python makes it very easy to convert the corpus into other formats, while also forcing the researchers to be aware of the machine readability of used annotation schemes.

The existence of a variety of annotation schemes and ELAN tier structures makes it very difficult to reuse existing tools for new projects. Creating tools for generalized use beyond the concrete conventions of one project poses a challenge, and often the only feasible approach seems to be to rewrite everything. However, we not only believe that common solutions can be found, but that they would benefit the field at large. Nevertheless, the discussion on how to achieve this is only just getting started. Our presentation is an attempt to progress this development.
Original languageEnglish
Publication statusPublished - 16 Aug 2019
MoE publication typeNot Eligible
EventResearch Data and Humanities 2019 - University of Oulu, Oulu, Finland
Duration: 14 Aug 201916 Aug 2019


ConferenceResearch Data and Humanities 2019
Abbreviated titleRDHum 2019
Internet address

Fields of Science

  • 6121 Languages

Cite this