Annotation of subtitle paraphrases using a new web tool

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.
Original languageEnglish
Title of host publicationProceedings of the Digital Humanities in the Nordic Countries 4th Conference
EditorsCostanza Navarretta, Manex Agirrezabal, Bente Maegaard
Number of pages16
Place of PublicationAachen
PublisherCEUR-WS.org
Publication date17 May 2019
Pages33-48
Publication statusPublished - 17 May 2019
MoE publication typeA4 Article in conference proceedings
EventDigital Humanities in the Nordic Countries - Copenhagen , Denmark
Duration: 5 Mar 20198 Mar 2019
Conference number: 4
https://cst.dk/DHN2019/DHN2019.html

Publication series

NameCEUR Workshop Proceedings
PublisherCEUR-WS.org
Number2364
ISSN (Electronic)1613-0073

Fields of Science

  • 113 Computer and information sciences
  • 6121 Languages

Cite this

Aulamo, M. J., Creutz, M. J. P., & Sjöblom, E. I. (2019). Annotation of subtitle paraphrases using a new web tool. In C. Navarretta, M. Agirrezabal, & B. Maegaard (Eds.), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference (pp. 33-48). (CEUR Workshop Proceedings ; No. 2364). Aachen: CEUR-WS.org.
Aulamo, Mikko Juhani ; Creutz, Mathias Johan Philip ; Sjöblom, Eetu Ilari. / Annotation of subtitle paraphrases using a new web tool. Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. editor / Costanza Navarretta ; Manex Agirrezabal ; Bente Maegaard. Aachen : CEUR-WS.org, 2019. pp. 33-48 (CEUR Workshop Proceedings ; 2364).
@inproceedings{10712c916ec441fc8f338a192792c125,
title = "Annotation of subtitle paraphrases using a new web tool",
abstract = "This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.",
keywords = "113 Computer and information sciences, 6121 Languages",
author = "Aulamo, {Mikko Juhani} and Creutz, {Mathias Johan Philip} and Sj{\"o}blom, {Eetu Ilari}",
year = "2019",
month = "5",
day = "17",
language = "English",
series = "CEUR Workshop Proceedings",
publisher = "CEUR-WS.org",
number = "2364",
pages = "33--48",
editor = "Costanza Navarretta and Manex Agirrezabal and Bente Maegaard",
booktitle = "Proceedings of the Digital Humanities in the Nordic Countries 4th Conference",
address = "Germany",

}

Aulamo, MJ, Creutz, MJP & Sjöblom, EI 2019, Annotation of subtitle paraphrases using a new web tool. in C Navarretta, M Agirrezabal & B Maegaard (eds), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. CEUR Workshop Proceedings , no. 2364, CEUR-WS.org, Aachen, pp. 33-48, Digital Humanities in the Nordic Countries, Copenhagen , Denmark, 05/03/2019.

Annotation of subtitle paraphrases using a new web tool. / Aulamo, Mikko Juhani; Creutz, Mathias Johan Philip; Sjöblom, Eetu Ilari.

Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. ed. / Costanza Navarretta; Manex Agirrezabal; Bente Maegaard. Aachen : CEUR-WS.org, 2019. p. 33-48 (CEUR Workshop Proceedings ; No. 2364).

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Annotation of subtitle paraphrases using a new web tool

AU - Aulamo, Mikko Juhani

AU - Creutz, Mathias Johan Philip

AU - Sjöblom, Eetu Ilari

PY - 2019/5/17

Y1 - 2019/5/17

N2 - This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.

AB - This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.

KW - 113 Computer and information sciences

KW - 6121 Languages

M3 - Conference contribution

T3 - CEUR Workshop Proceedings

SP - 33

EP - 48

BT - Proceedings of the Digital Humanities in the Nordic Countries 4th Conference

A2 - Navarretta, Costanza

A2 - Agirrezabal, Manex

A2 - Maegaard, Bente

PB - CEUR-WS.org

CY - Aachen

ER -

Aulamo MJ, Creutz MJP, Sjöblom EI. Annotation of subtitle paraphrases using a new web tool. In Navarretta C, Agirrezabal M, Maegaard B, editors, Proceedings of the Digital Humanities in the Nordic Countries 4th Conference. Aachen: CEUR-WS.org. 2019. p. 33-48. (CEUR Workshop Proceedings ; 2364).