TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences "meaning the same thing". This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 - 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists. The dataset is available at https://doi.org/10.5281/zenodo.3707949.
Original languageEnglish
Title of host publicationProceedings of The 12th Language Resources and Evaluation Conference
EditorsNicoletta Calzolari [et al.]
Number of pages6
Place of PublicationParis
PublisherEuropean Language Resources Association (ELRA)
Publication date1 May 2020
Pages6868-6873
ISBN (Electronic)979-10-95546-34-4
Publication statusPublished - 1 May 2020
MoE publication typeA4 Article in conference proceedings
EventLanguage Resources and Evaluation Conference - [LREC 2020 was cancelled]
Duration: 11 May 202016 May 2020
Conference number: 12
https://lrec2020.lrec-conf.org/

Fields of Science

  • 6121 Languages

Cite this