English historical corpora in transition: from new tools to legacy corpora?

Research output: Chapter in Book/Report/Conference proceedingChapterScientificpeer-review

Abstract

The first multigenre historical corpora of the English language were published in the early 1990s, almost thirty years after the first Present-Day English corpus was released in 1964. The Helsinki Corpus of English Texts (HC) came out in 1991, and the Helsinki Corpus of Older Scots (HCOS) in 1995. The introduction to the latter justifiably called it a ‘new tool’ (Meurman-Solin 1995). These tools were new in several respects. They provided systematically selected data on historical varieties of English, comprising closely matching genres from consecutive periods of time. They also made it possible to search texts using an extensive set of metadata, including period-, variety-, and writer-specific information.

However, twenty years is a long time in the life of electronic data sources – long enough in fact to make the first Present-Day English corpora in the Brown Corpus family ‘historical’. Like these first synchronic corpora, the diachronic corpora of the 1990s were carefully designed but small. The Helsinki Corpus, for example, amounts to c. 1.5 million running words. Twenty years on, corpora of this kind are sometimes called ‘bijou’ corpora, in contrast to the hundreds of millions of words contained, for example, in COHA, a monitor corpus of historical American English (for more details on English historical corpora, see CoRD).

This paper considers the various material and methodological issues in English historical corpus linguistics that have changed since the pioneering days twenty years ago. I will suggest a division of labour between ‘legacy’ corpora and their mega-sized successors, and discuss the trade-off between corpus annotation and corpus size.
Original languageEnglish
Title of host publicationNew Methods in Historical Corpora
EditorsPaul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt
Number of pages17
Place of PublicationTübingen
PublisherGunter Narr Verlag
Publication date2013
Pages37-53
ISBN (Print)978-3-8233-6760-4
Publication statusPublished - 2013
MoE publication typeA3 Book chapter

Publication series

NameCorpus Linguistics and Interdisciplinary Perspectives on Language
PublisherGunter Narr Verlag
Volume3
ISSN (Print)2191-9577

Fields of Science

  • 6121 Languages
  • corpora
  • historical linguistics

Cite this

Nevalainen, T. (2013). English historical corpora in transition: from new tools to legacy corpora? In P. Bennett, M. Durrell, S. Scheible, & R. J. Whitt (Eds.), New Methods in Historical Corpora (pp. 37-53). (Corpus Linguistics and Interdisciplinary Perspectives on Language; Vol. 3). Tübingen: Gunter Narr Verlag.
Nevalainen, Terttu. / English historical corpora in transition: from new tools to legacy corpora?. New Methods in Historical Corpora . editor / Paul Bennett ; Martin Durrell ; Silke Scheible ; Richard J. Whitt. Tübingen : Gunter Narr Verlag, 2013. pp. 37-53 (Corpus Linguistics and Interdisciplinary Perspectives on Language).
@inbook{243a663e663b4084b6cacd8a7cbe7cd4,
title = "English historical corpora in transition: from new tools to legacy corpora?",
abstract = "The first multigenre historical corpora of the English language were published in the early 1990s, almost thirty years after the first Present-Day English corpus was released in 1964. The Helsinki Corpus of English Texts (HC) came out in 1991, and the Helsinki Corpus of Older Scots (HCOS) in 1995. The introduction to the latter justifiably called it a ‘new tool’ (Meurman-Solin 1995). These tools were new in several respects. They provided systematically selected data on historical varieties of English, comprising closely matching genres from consecutive periods of time. They also made it possible to search texts using an extensive set of metadata, including period-, variety-, and writer-specific information.However, twenty years is a long time in the life of electronic data sources – long enough in fact to make the first Present-Day English corpora in the Brown Corpus family ‘historical’. Like these first synchronic corpora, the diachronic corpora of the 1990s were carefully designed but small. The Helsinki Corpus, for example, amounts to c. 1.5 million running words. Twenty years on, corpora of this kind are sometimes called ‘bijou’ corpora, in contrast to the hundreds of millions of words contained, for example, in COHA, a monitor corpus of historical American English (for more details on English historical corpora, see CoRD).This paper considers the various material and methodological issues in English historical corpus linguistics that have changed since the pioneering days twenty years ago. I will suggest a division of labour between ‘legacy’ corpora and their mega-sized successors, and discuss the trade-off between corpus annotation and corpus size.",
keywords = "6121 Languages, corpora, historical linguistics",
author = "Terttu Nevalainen",
year = "2013",
language = "English",
isbn = "978-3-8233-6760-4",
series = "Corpus Linguistics and Interdisciplinary Perspectives on Language",
publisher = "Gunter Narr Verlag",
pages = "37--53",
editor = "Paul Bennett and Durrell, {Martin } and Scheible, {Silke } and Whitt, {Richard J. }",
booktitle = "New Methods in Historical Corpora",
address = "Germany",

}

Nevalainen, T 2013, English historical corpora in transition: from new tools to legacy corpora? in P Bennett, M Durrell, S Scheible & RJ Whitt (eds), New Methods in Historical Corpora . Corpus Linguistics and Interdisciplinary Perspectives on Language, vol. 3, Gunter Narr Verlag, Tübingen, pp. 37-53.

English historical corpora in transition: from new tools to legacy corpora? / Nevalainen, Terttu.

New Methods in Historical Corpora . ed. / Paul Bennett; Martin Durrell; Silke Scheible; Richard J. Whitt. Tübingen : Gunter Narr Verlag, 2013. p. 37-53 (Corpus Linguistics and Interdisciplinary Perspectives on Language; Vol. 3).

Research output: Chapter in Book/Report/Conference proceedingChapterScientificpeer-review

TY - CHAP

T1 - English historical corpora in transition: from new tools to legacy corpora?

AU - Nevalainen, Terttu

PY - 2013

Y1 - 2013

N2 - The first multigenre historical corpora of the English language were published in the early 1990s, almost thirty years after the first Present-Day English corpus was released in 1964. The Helsinki Corpus of English Texts (HC) came out in 1991, and the Helsinki Corpus of Older Scots (HCOS) in 1995. The introduction to the latter justifiably called it a ‘new tool’ (Meurman-Solin 1995). These tools were new in several respects. They provided systematically selected data on historical varieties of English, comprising closely matching genres from consecutive periods of time. They also made it possible to search texts using an extensive set of metadata, including period-, variety-, and writer-specific information.However, twenty years is a long time in the life of electronic data sources – long enough in fact to make the first Present-Day English corpora in the Brown Corpus family ‘historical’. Like these first synchronic corpora, the diachronic corpora of the 1990s were carefully designed but small. The Helsinki Corpus, for example, amounts to c. 1.5 million running words. Twenty years on, corpora of this kind are sometimes called ‘bijou’ corpora, in contrast to the hundreds of millions of words contained, for example, in COHA, a monitor corpus of historical American English (for more details on English historical corpora, see CoRD).This paper considers the various material and methodological issues in English historical corpus linguistics that have changed since the pioneering days twenty years ago. I will suggest a division of labour between ‘legacy’ corpora and their mega-sized successors, and discuss the trade-off between corpus annotation and corpus size.

AB - The first multigenre historical corpora of the English language were published in the early 1990s, almost thirty years after the first Present-Day English corpus was released in 1964. The Helsinki Corpus of English Texts (HC) came out in 1991, and the Helsinki Corpus of Older Scots (HCOS) in 1995. The introduction to the latter justifiably called it a ‘new tool’ (Meurman-Solin 1995). These tools were new in several respects. They provided systematically selected data on historical varieties of English, comprising closely matching genres from consecutive periods of time. They also made it possible to search texts using an extensive set of metadata, including period-, variety-, and writer-specific information.However, twenty years is a long time in the life of electronic data sources – long enough in fact to make the first Present-Day English corpora in the Brown Corpus family ‘historical’. Like these first synchronic corpora, the diachronic corpora of the 1990s were carefully designed but small. The Helsinki Corpus, for example, amounts to c. 1.5 million running words. Twenty years on, corpora of this kind are sometimes called ‘bijou’ corpora, in contrast to the hundreds of millions of words contained, for example, in COHA, a monitor corpus of historical American English (for more details on English historical corpora, see CoRD).This paper considers the various material and methodological issues in English historical corpus linguistics that have changed since the pioneering days twenty years ago. I will suggest a division of labour between ‘legacy’ corpora and their mega-sized successors, and discuss the trade-off between corpus annotation and corpus size.

KW - 6121 Languages

KW - corpora

KW - historical linguistics

M3 - Chapter

SN - 978-3-8233-6760-4

T3 - Corpus Linguistics and Interdisciplinary Perspectives on Language

SP - 37

EP - 53

BT - New Methods in Historical Corpora

A2 - Bennett, Paul

A2 - Durrell, Martin

A2 - Scheible, Silke

A2 - Whitt, Richard J.

PB - Gunter Narr Verlag

CY - Tübingen

ER -

Nevalainen T. English historical corpora in transition: from new tools to legacy corpora? In Bennett P, Durrell M, Scheible S, Whitt RJ, editors, New Methods in Historical Corpora . Tübingen: Gunter Narr Verlag. 2013. p. 37-53. (Corpus Linguistics and Interdisciplinary Perspectives on Language).