Annotation of subtitle paraphrases using a new web tool

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Kuvaus

This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.
Alkuperäiskielienglanti
OtsikkoProceedings of the Digital Humanities in the Nordic Countries 4th Conference
ToimittajatCostanza Navarretta, Manex Agirrezabal, Bente Maegaard
Sivumäärä16
JulkaisupaikkaAachen
KustantajaCEUR-WS.org
Julkaisupäivä17 toukokuuta 2019
Sivut33-48
TilaJulkaistu - 17 toukokuuta 2019
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaDigital Humanities in the Nordic Countries - Copenhagen , Tanska
Kesto: 5 maaliskuuta 20198 maaliskuuta 2019
Konferenssinumero: 4
https://cst.dk/DHN2019/DHN2019.html

Julkaisusarja

NimiCEUR Workshop Proceedings
KustantajaCEUR-WS.org
Numero2364
ISSN (elektroninen)1613-0073

Tieteenalat

  • 113 Tietojenkäsittely- ja informaatiotieteet
  • 6121 Kielitieteet

Lainaa tätä

Aulamo, M. J., Creutz, M. J. P., & Sjöblom, E. I. (2019). Annotation of subtitle paraphrases using a new web tool. teoksessa C. Navarretta, M. Agirrezabal, & B. Maegaard (Toimittajat), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference (Sivut 33-48). (CEUR Workshop Proceedings ; Nro 2364). Aachen: CEUR-WS.org.
Aulamo, Mikko Juhani ; Creutz, Mathias Johan Philip ; Sjöblom, Eetu Ilari. / Annotation of subtitle paraphrases using a new web tool. Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. Toimittaja / Costanza Navarretta ; Manex Agirrezabal ; Bente Maegaard. Aachen : CEUR-WS.org, 2019. Sivut 33-48 (CEUR Workshop Proceedings ; 2364).
@inproceedings{10712c916ec441fc8f338a192792c125,
title = "Annotation of subtitle paraphrases using a new web tool",
abstract = "This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.",
keywords = "113 Computer and information sciences, 6121 Languages",
author = "Aulamo, {Mikko Juhani} and Creutz, {Mathias Johan Philip} and Sj{\"o}blom, {Eetu Ilari}",
year = "2019",
month = "5",
day = "17",
language = "English",
series = "CEUR Workshop Proceedings",
publisher = "CEUR-WS.org",
number = "2364",
pages = "33--48",
editor = "Costanza Navarretta and Manex Agirrezabal and Bente Maegaard",
booktitle = "Proceedings of the Digital Humanities in the Nordic Countries 4th Conference",
address = "Germany",

}

Aulamo, MJ, Creutz, MJP & Sjöblom, EI 2019, Annotation of subtitle paraphrases using a new web tool. julkaisussa C Navarretta, M Agirrezabal & B Maegaard (toim), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. CEUR Workshop Proceedings , Nro 2364, CEUR-WS.org, Aachen, Sivut 33-48, Digital Humanities in the Nordic Countries, Copenhagen , Tanska, 05/03/2019.

Annotation of subtitle paraphrases using a new web tool. / Aulamo, Mikko Juhani; Creutz, Mathias Johan Philip; Sjöblom, Eetu Ilari.

Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. toim. / Costanza Navarretta; Manex Agirrezabal; Bente Maegaard. Aachen : CEUR-WS.org, 2019. s. 33-48 (CEUR Workshop Proceedings ; Nro 2364).

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

TY - GEN

T1 - Annotation of subtitle paraphrases using a new web tool

AU - Aulamo, Mikko Juhani

AU - Creutz, Mathias Johan Philip

AU - Sjöblom, Eetu Ilari

PY - 2019/5/17

Y1 - 2019/5/17

N2 - This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.

AB - This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.

KW - 113 Computer and information sciences

KW - 6121 Languages

M3 - Conference contribution

T3 - CEUR Workshop Proceedings

SP - 33

EP - 48

BT - Proceedings of the Digital Humanities in the Nordic Countries 4th Conference

A2 - Navarretta, Costanza

A2 - Agirrezabal, Manex

A2 - Maegaard, Bente

PB - CEUR-WS.org

CY - Aachen

ER -

Aulamo MJ, Creutz MJP, Sjöblom EI. Annotation of subtitle paraphrases using a new web tool. julkaisussa Navarretta C, Agirrezabal M, Maegaard B, toimittajat, Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. Aachen: CEUR-WS.org. 2019. s. 33-48. (CEUR Workshop Proceedings ; 2364).