Annotation of subtitle paraphrases using a new web tool

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.
Originalspråkengelska
Titel på gästpublikationProceedings of the Digital Humanities in the Nordic Countries 4th Conference
RedaktörerCostanza Navarretta, Manex Agirrezabal, Bente Maegaard
Antal sidor16
UtgivningsortAachen
FörlagCEUR-WS.org
Utgivningsdatum17 maj 2019
Sidor33-48
StatusPublicerad - 17 maj 2019
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangDigital Humanities in the Nordic Countries - Copenhagen , Danmark
Varaktighet: 5 mar 20198 mar 2019
Konferensnummer: 4
https://cst.dk/DHN2019/DHN2019.html

Publikationsserier

NamnCEUR Workshop Proceedings
FörlagCEUR-WS.org
Nummer2364
ISSN (elektroniskt)1613-0073

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap
  • 6121 Språkvetenskaper

Citera det här

Aulamo, M. J., Creutz, M. J. P., & Sjöblom, E. I. (2019). Annotation of subtitle paraphrases using a new web tool. I C. Navarretta, M. Agirrezabal, & B. Maegaard (Red.), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference (s. 33-48). (CEUR Workshop Proceedings ; Nr. 2364). Aachen: CEUR-WS.org.
Aulamo, Mikko Juhani ; Creutz, Mathias Johan Philip ; Sjöblom, Eetu Ilari. / Annotation of subtitle paraphrases using a new web tool. Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. redaktör / Costanza Navarretta ; Manex Agirrezabal ; Bente Maegaard. Aachen : CEUR-WS.org, 2019. s. 33-48 (CEUR Workshop Proceedings ; 2364).
@inproceedings{10712c916ec441fc8f338a192792c125,
title = "Annotation of subtitle paraphrases using a new web tool",
abstract = "This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.",
keywords = "113 Computer and information sciences, 6121 Languages",
author = "Aulamo, {Mikko Juhani} and Creutz, {Mathias Johan Philip} and Sj{\"o}blom, {Eetu Ilari}",
year = "2019",
month = "5",
day = "17",
language = "English",
series = "CEUR Workshop Proceedings",
publisher = "CEUR-WS.org",
number = "2364",
pages = "33--48",
editor = "Costanza Navarretta and Manex Agirrezabal and Bente Maegaard",
booktitle = "Proceedings of the Digital Humanities in the Nordic Countries 4th Conference",
address = "Germany",

}

Aulamo, MJ, Creutz, MJP & Sjöblom, EI 2019, Annotation of subtitle paraphrases using a new web tool. i C Navarretta, M Agirrezabal & B Maegaard (red), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. CEUR Workshop Proceedings , nr. 2364, CEUR-WS.org, Aachen, s. 33-48, Digital Humanities in the Nordic Countries, Copenhagen , Danmark, 05/03/2019.

Annotation of subtitle paraphrases using a new web tool. / Aulamo, Mikko Juhani; Creutz, Mathias Johan Philip; Sjöblom, Eetu Ilari.

Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. red. / Costanza Navarretta; Manex Agirrezabal; Bente Maegaard. Aachen : CEUR-WS.org, 2019. s. 33-48 (CEUR Workshop Proceedings ; Nr. 2364).

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

TY - GEN

T1 - Annotation of subtitle paraphrases using a new web tool

AU - Aulamo, Mikko Juhani

AU - Creutz, Mathias Johan Philip

AU - Sjöblom, Eetu Ilari

PY - 2019/5/17

Y1 - 2019/5/17

N2 - This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.

AB - This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.

KW - 113 Computer and information sciences

KW - 6121 Languages

M3 - Conference contribution

T3 - CEUR Workshop Proceedings

SP - 33

EP - 48

BT - Proceedings of the Digital Humanities in the Nordic Countries 4th Conference

A2 - Navarretta, Costanza

A2 - Agirrezabal, Manex

A2 - Maegaard, Bente

PB - CEUR-WS.org

CY - Aachen

ER -

Aulamo MJ, Creutz MJP, Sjöblom EI. Annotation of subtitle paraphrases using a new web tool. I Navarretta C, Agirrezabal M, Maegaard B, redaktörer, Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. Aachen: CEUR-WS.org. 2019. s. 33-48. (CEUR Workshop Proceedings ; 2364).