Projects per year
Abstract
Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the {``}Finno-Ugric Languages and the Internet{''} (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.
Original language | English |
---|---|
Title of host publication | Proceedings of the 12th Web as Corpus Workshop |
Editors | Adrien Barbaresi, Felix Bildhauer, Roland Schäfer, Egon Stemle |
Number of pages | 10 |
Place of Publication | Stroudsburg |
Publisher | The Association for Computational Linguistics |
Publication date | 2020 |
Pages | 23-32 |
ISBN (Electronic) | 979-10-95546-68-9 |
Publication status | Published - 2020 |
MoE publication type | A4 Article in conference proceedings |
Event | Language Resources and Evaluation Conference - [LREC 2020 was cancelled] Duration: 11 May 2020 → 16 May 2020 Conference number: 12 https://lrec2020.lrec-conf.org/ |
Fields of Science
- 6121 Languages
Projects
- 1 Finished
-
Finno-Ugric Languages and the Internet
Linden, K. (Other), Jauhiainen, H. (Participant) & Jauhiainen, T. (Participant)
01/01/2013 → 31/12/2018
Project: Research project
Datasets
-
Wanca 2016, Korp-versio
Jauhiainen, H. (Creator) & Jauhiainen, T. (Creator), Language Bank of Finland, Aug 2019
http://urn.fi/urn:nbn:fi:lb-2019052401
Dataset
Equipment
-
CLARIN Common Language Resource and Technology Infrastructure in Finland
Linden, K. (Manager)
Department of Digital HumanitiesFacility/equipment: Coordination office