TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences "meaning the same thing". This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 - 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists. The dataset is available at https://doi.org/10.5281/zenodo.3707949.
Original languageEnglish
Title of host publicationProceedings of The 12th Language Resources and Evaluation Conference
Number of pages6
Place of PublicationMarseille, France
PublisherEuropean Language Resources Association (ELRA)
Publication date1 May 2020
Pages6868-6873
Publication statusPublished - 1 May 2020
MoE publication typeA4 Article in conference proceedings

Projects

Cite this

Scherrer, Y. (2020). TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 6868-6873). Marseille, France: European Language Resources Association (ELRA).