Open Subtitles Paraphrase Corpus for Six Languages

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Kuvaus

This paper accompanies the release of Opusparcus, a new paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The corpus consists of paraphrases, that is, pairs of sentences in the same language that mean approximately the same thing. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The informal and colloquial genre that occurs in subtitles makes such data a very interesting language resource, for instance, from the perspective of computer assisted language learning. For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been checked manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.
Alkuperäiskielienglanti
OtsikkoProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
ToimittajatNicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga
Sivumäärä6
JulkaisupaikkaParis
KustantajaEuropean Language Resources Association (ELRA)
Julkaisupäivä10 toukokuuta 2018
Sivut1364-1369
ISBN (elektroninen)979-10-95546-00-9
TilaJulkaistu - 10 toukokuuta 2018
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaThe International Conference on Language Resources and Evaluation - Miyazaki, Japani
Kesto: 7 toukokuuta 201812 toukokuuta 2018
Konferenssinumero: 11
http://lrec2018.lrec-conf.org/en/

Tieteenalat

  • 6121 Kielitieteet

Lainaa tätä

Creutz, M. J. P. (2018). Open Subtitles Paraphrase Corpus for Six Languages. teoksessa N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, ... T. Tokunaga (Toimittajat), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (Sivut 1364-1369). Paris: European Language Resources Association (ELRA).
Creutz, Mathias Johan Philip. / Open Subtitles Paraphrase Corpus for Six Languages. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Toimittaja / Nicoletta Calzolari ; Khalid Choukri ; Christopher Cieri ; Thierry Declerck ; Sara Goggi ; Koiti Hasida ; Hitoshi Isahara ; Bente Maegaard ; Joseph Mariani ; Hélène Mazo ; Asuncion Moreno ; Jan Odijk ; Stelios Piperidis ; Takenobu Tokunaga. Paris : European Language Resources Association (ELRA), 2018. Sivut 1364-1369
@inproceedings{061fcd9be37049bdbaf079343d207a97,
title = "Open Subtitles Paraphrase Corpus for Six Languages",
abstract = "This paper accompanies the release of Opusparcus, a new paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The corpus consists of paraphrases, that is, pairs of sentences in the same language that mean approximately the same thing. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The informal and colloquial genre that occurs in subtitles makes such data a very interesting language resource, for instance, from the perspective of computer assisted language learning. For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been checked manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.",
keywords = "6121 Languages",
author = "Creutz, {Mathias Johan Philip}",
year = "2018",
month = "5",
day = "10",
language = "English",
pages = "1364--1369",
editor = "Calzolari, {Nicoletta } and Choukri, {Khalid } and Cieri, {Christopher } and Declerck, {Thierry } and Goggi, {Sara } and { Hasida}, {Koiti } and Isahara, {Hitoshi } and { Maegaard}, {Bente } and Mariani, {Joseph } and Mazo, {H{\'e}l{\`e}ne } and Moreno, {Asuncion } and { Odijk}, Jan and Piperidis, {Stelios } and Tokunaga, {Takenobu }",
booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)",
publisher = "European Language Resources Association (ELRA)",
address = "International",

}

Creutz, MJP 2018, Open Subtitles Paraphrase Corpus for Six Languages. julkaisussa N Calzolari, K Choukri, C Cieri, T Declerck, S Goggi, K Hasida, H Isahara, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk, S Piperidis & T Tokunaga (toim), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Paris, Sivut 1364-1369, The International Conference on Language Resources and Evaluation, Miyazaki, Japani, 07/05/2018.

Open Subtitles Paraphrase Corpus for Six Languages. / Creutz, Mathias Johan Philip.

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). toim. / Nicoletta Calzolari; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Koiti Hasida; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Hélène Mazo; Asuncion Moreno; Jan Odijk; Stelios Piperidis; Takenobu Tokunaga. Paris : European Language Resources Association (ELRA), 2018. s. 1364-1369.

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

TY - GEN

T1 - Open Subtitles Paraphrase Corpus for Six Languages

AU - Creutz, Mathias Johan Philip

PY - 2018/5/10

Y1 - 2018/5/10

N2 - This paper accompanies the release of Opusparcus, a new paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The corpus consists of paraphrases, that is, pairs of sentences in the same language that mean approximately the same thing. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The informal and colloquial genre that occurs in subtitles makes such data a very interesting language resource, for instance, from the perspective of computer assisted language learning. For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been checked manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.

AB - This paper accompanies the release of Opusparcus, a new paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The corpus consists of paraphrases, that is, pairs of sentences in the same language that mean approximately the same thing. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The informal and colloquial genre that occurs in subtitles makes such data a very interesting language resource, for instance, from the perspective of computer assisted language learning. For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been checked manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.

KW - 6121 Languages

M3 - Conference contribution

SP - 1364

EP - 1369

BT - Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Hasida, Koiti

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Hélène

A2 - Moreno, Asuncion

A2 - Odijk, Jan

A2 - Piperidis, Stelios

A2 - Tokunaga, Takenobu

PB - European Language Resources Association (ELRA)

CY - Paris

ER -

Creutz MJP. Open Subtitles Paraphrase Corpus for Six Languages. julkaisussa Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, toimittajat, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Paris: European Language Resources Association (ELRA). 2018. s. 1364-1369