Improving corpus annotation productivity: a method and experiment with interactive tagging

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Corpus linguistic and language technological research needs empirical corpus data with nearly correct annotation and high volume to enable advances in language modelling and theorising. Recent work on improving corpus annotation accuracy presents semiautomatic methods to correct some of the analysis errors in available annotated corpora, while leaving the remaining errors undetected in the annotated corpus. We review recent advances in linguistics-based partial tagging and parsing, and regard the achieved analysis performance as sufficient for reconsidering a previously proposed method: combining nearly correct but partial automatic analysis with a minimal amount of human postediting (disambiguation) to achieve nearly correct corpus annotation accuracy at a competitive annotation speed. We report a pilot experiment with morphological (part-of-speech) annotation using a partial linguistic tagger of a kind previously reported with a very attractive precision-recall ratio, and observe that a desired level of annotation accuracy can be reached by using human disambiguation for less than 10\% of the words in the corpus.
Original languageEnglish
Title of host publicationProceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)
PublisherEuropean Language Resources Association (ELRA)
Publication date2012
ISBN (Electronic)978-2-9517408-7-7
Publication statusPublished - 2012
MoE publication typeA4 Article in conference proceedings
EventLREC 2012 - Istanbul, Turkey
Duration: 23 May 201225 May 2012
Conference number: 8

Fields of Science

  • 6121 Languages
  • treebanks

Cite this

Voutilainen, A. (2012). Improving corpus annotation productivity: a method and experiment with interactive tagging. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12) European Language Resources Association (ELRA).
Voutilainen, Atro. / Improving corpus annotation productivity: a method and experiment with interactive tagging. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). European Language Resources Association (ELRA), 2012.
@inproceedings{046494fdcd8f4069a2ab3f60ad50ff9c,
title = "Improving corpus annotation productivity: a method and experiment with interactive tagging",
abstract = "Corpus linguistic and language technological research needs empirical corpus data with nearly correct annotation and high volume to enable advances in language modelling and theorising. Recent work on improving corpus annotation accuracy presents semiautomatic methods to correct some of the analysis errors in available annotated corpora, while leaving the remaining errors undetected in the annotated corpus. We review recent advances in linguistics-based partial tagging and parsing, and regard the achieved analysis performance as sufficient for reconsidering a previously proposed method: combining nearly correct but partial automatic analysis with a minimal amount of human postediting (disambiguation) to achieve nearly correct corpus annotation accuracy at a competitive annotation speed. We report a pilot experiment with morphological (part-of-speech) annotation using a partial linguistic tagger of a kind previously reported with a very attractive precision-recall ratio, and observe that a desired level of annotation accuracy can be reached by using human disambiguation for less than 10\{\%} of the words in the corpus.",
keywords = "6121 Languages, treebanks",
author = "Atro Voutilainen",
note = "Volume: Proceeding volume:",
year = "2012",
language = "English",
booktitle = "Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)",
publisher = "European Language Resources Association (ELRA)",
address = "International",

}

Voutilainen, A 2012, Improving corpus annotation productivity: a method and experiment with interactive tagging. in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). European Language Resources Association (ELRA), LREC 2012, Istanbul, Turkey, 23/05/2012.

Improving corpus annotation productivity: a method and experiment with interactive tagging. / Voutilainen, Atro.

Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). European Language Resources Association (ELRA), 2012.

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Improving corpus annotation productivity: a method and experiment with interactive tagging

AU - Voutilainen, Atro

N1 - Volume: Proceeding volume:

PY - 2012

Y1 - 2012

N2 - Corpus linguistic and language technological research needs empirical corpus data with nearly correct annotation and high volume to enable advances in language modelling and theorising. Recent work on improving corpus annotation accuracy presents semiautomatic methods to correct some of the analysis errors in available annotated corpora, while leaving the remaining errors undetected in the annotated corpus. We review recent advances in linguistics-based partial tagging and parsing, and regard the achieved analysis performance as sufficient for reconsidering a previously proposed method: combining nearly correct but partial automatic analysis with a minimal amount of human postediting (disambiguation) to achieve nearly correct corpus annotation accuracy at a competitive annotation speed. We report a pilot experiment with morphological (part-of-speech) annotation using a partial linguistic tagger of a kind previously reported with a very attractive precision-recall ratio, and observe that a desired level of annotation accuracy can be reached by using human disambiguation for less than 10\% of the words in the corpus.

AB - Corpus linguistic and language technological research needs empirical corpus data with nearly correct annotation and high volume to enable advances in language modelling and theorising. Recent work on improving corpus annotation accuracy presents semiautomatic methods to correct some of the analysis errors in available annotated corpora, while leaving the remaining errors undetected in the annotated corpus. We review recent advances in linguistics-based partial tagging and parsing, and regard the achieved analysis performance as sufficient for reconsidering a previously proposed method: combining nearly correct but partial automatic analysis with a minimal amount of human postediting (disambiguation) to achieve nearly correct corpus annotation accuracy at a competitive annotation speed. We report a pilot experiment with morphological (part-of-speech) annotation using a partial linguistic tagger of a kind previously reported with a very attractive precision-recall ratio, and observe that a desired level of annotation accuracy can be reached by using human disambiguation for less than 10\% of the words in the corpus.

KW - 6121 Languages

KW - treebanks

M3 - Conference contribution

BT - Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)

PB - European Language Resources Association (ELRA)

ER -

Voutilainen A. Improving corpus annotation productivity: a method and experiment with interactive tagging. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). European Language Resources Association (ELRA). 2012