Open Subtitles Paraphrase Corpus for Six Languages

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

This paper accompanies the release of Opusparcus, a new paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The corpus consists of paraphrases, that is, pairs of sentences in the same language that mean approximately the same thing. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The informal and colloquial genre that occurs in subtitles makes such data a very interesting language resource, for instance, from the perspective of computer assisted language learning. For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been checked manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.
Originalspråkengelska
Titel på gästpublikationProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
RedaktörerNicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga
Antal sidor6
UtgivningsortParis
FörlagEuropean Language Resources Association (ELRA)
Utgivningsdatum10 maj 2018
Sidor1364-1369
ISBN (elektroniskt)979-10-95546-00-9
StatusPublicerad - 10 maj 2018
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangThe International Conference on Language Resources and Evaluation - Miyazaki, Japan
Varaktighet: 7 maj 201812 maj 2018
Konferensnummer: 11
http://lrec2018.lrec-conf.org/en/

Vetenskapsgrenar

  • 6121 Språkvetenskaper

Citera det här

Creutz, M. J. P. (2018). Open Subtitles Paraphrase Corpus for Six Languages. I N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, ... T. Tokunaga (Red.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (s. 1364-1369). Paris: European Language Resources Association (ELRA).
Creutz, Mathias Johan Philip. / Open Subtitles Paraphrase Corpus for Six Languages. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). redaktör / Nicoletta Calzolari ; Khalid Choukri ; Christopher Cieri ; Thierry Declerck ; Sara Goggi ; Koiti Hasida ; Hitoshi Isahara ; Bente Maegaard ; Joseph Mariani ; Hélène Mazo ; Asuncion Moreno ; Jan Odijk ; Stelios Piperidis ; Takenobu Tokunaga. Paris : European Language Resources Association (ELRA), 2018. s. 1364-1369
@inproceedings{061fcd9be37049bdbaf079343d207a97,
title = "Open Subtitles Paraphrase Corpus for Six Languages",
abstract = "This paper accompanies the release of Opusparcus, a new paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The corpus consists of paraphrases, that is, pairs of sentences in the same language that mean approximately the same thing. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The informal and colloquial genre that occurs in subtitles makes such data a very interesting language resource, for instance, from the perspective of computer assisted language learning. For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been checked manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.",
keywords = "6121 Languages",
author = "Creutz, {Mathias Johan Philip}",
year = "2018",
month = "5",
day = "10",
language = "English",
pages = "1364--1369",
editor = "Calzolari, {Nicoletta } and Choukri, {Khalid } and Cieri, {Christopher } and Declerck, {Thierry } and Goggi, {Sara } and { Hasida}, {Koiti } and Isahara, {Hitoshi } and { Maegaard}, {Bente } and Mariani, {Joseph } and Mazo, {H{\'e}l{\`e}ne } and Moreno, {Asuncion } and { Odijk}, Jan and Piperidis, {Stelios } and Tokunaga, {Takenobu }",
booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)",
publisher = "European Language Resources Association (ELRA)",
address = "International",

}

Creutz, MJP 2018, Open Subtitles Paraphrase Corpus for Six Languages. i N Calzolari, K Choukri, C Cieri, T Declerck, S Goggi, K Hasida, H Isahara, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk, S Piperidis & T Tokunaga (red), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Paris, s. 1364-1369, The International Conference on Language Resources and Evaluation, Miyazaki, Japan, 07/05/2018.

Open Subtitles Paraphrase Corpus for Six Languages. / Creutz, Mathias Johan Philip.

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). red. / Nicoletta Calzolari; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Koiti Hasida; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Hélène Mazo; Asuncion Moreno; Jan Odijk; Stelios Piperidis; Takenobu Tokunaga. Paris : European Language Resources Association (ELRA), 2018. s. 1364-1369.

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

TY - GEN

T1 - Open Subtitles Paraphrase Corpus for Six Languages

AU - Creutz, Mathias Johan Philip

PY - 2018/5/10

Y1 - 2018/5/10

N2 - This paper accompanies the release of Opusparcus, a new paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The corpus consists of paraphrases, that is, pairs of sentences in the same language that mean approximately the same thing. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The informal and colloquial genre that occurs in subtitles makes such data a very interesting language resource, for instance, from the perspective of computer assisted language learning. For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been checked manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.

AB - This paper accompanies the release of Opusparcus, a new paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The corpus consists of paraphrases, that is, pairs of sentences in the same language that mean approximately the same thing. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The informal and colloquial genre that occurs in subtitles makes such data a very interesting language resource, for instance, from the perspective of computer assisted language learning. For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been checked manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.

KW - 6121 Languages

M3 - Conference contribution

SP - 1364

EP - 1369

BT - Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Hasida, Koiti

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Hélène

A2 - Moreno, Asuncion

A2 - Odijk, Jan

A2 - Piperidis, Stelios

A2 - Tokunaga, Takenobu

PB - European Language Resources Association (ELRA)

CY - Paris

ER -

Creutz MJP. Open Subtitles Paraphrase Corpus for Six Languages. I Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, redaktörer, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Paris: European Language Resources Association (ELRA). 2018. s. 1364-1369