The burden of legacy: Producing the Tagged Corpus of Early English Correspondence Extension (TCEECE)

Forskningsoutput: TidskriftsbidragArtikelVetenskapligPeer review

Sammanfattning

This paper discusses the process of part-of-speech tagging the Corpus of Early English Correspondence Extension (CEECE), as well as the end result. The process involved normalisation of historical spelling variation, conversion from a legacy format into TEI-XML, and finally, tokenisation and tagging by the CLAWS software. At each stage, we had to face and work around problems such as whether to retain original spelling variants in corpus markup, how to implement overlapping hierarchies in XML, and how to calculate the accuracy of tagging in a way that acknowledges errors in tokenisation. The final tagged corpus is estimated to have an accuracy of 94.5 per cent (in the C7 tagset), which is circa two percentage points (pp) lower than that of present-day corpora but respectable for Late Modern English. The most accurate tag groups include pronouns and numerals, whereas adjectives and adverbs are among the least accurate. Normalisation increased the overall accuracy of tagging by circa 3.7pp. The combination of POS tagging and social metadata will make the corpus attractive to linguists interested in the interplay between language-internal and -external factors affecting variation and change.
Originalspråkengelska
TidskriftResearch in Corpus Linguistics
Volym9
Nummer1
Sidor (från-till)104-131
Antal sidor28
ISSN2243-4712
DOI
StatusPublicerad - 2021
MoE-publikationstypA1 Tidskriftsartikel-refererad

Bibliografisk information

Special issue, Challenges of Combining Structured and Unstructured Data in Corpus Development, ed. by Tanja Säily & Jukka Tyrkkö.

Vetenskapsgrenar

  • 6121 Språkvetenskaper

Citera det här