A simple model for recognizing core genres in the BNC

    Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKapitelVetenskapligPeer review

    Sammanfattning

    Human communicative practices are organized in terms of genres, and people are highly skilled at recognizing genre differences. In text corpora, genres are typically defined on the basis of text-external features, such as medium, function and format. We show that the core genres of face-to-face conversation, prose fiction, broadsheet newspapers, and academic prose can also be reliably recognized based on a small set of text-internal (linguistic) surface features. Using a 40-million-word subset of the British National Corpus, we study select text-internal surface features that capture language complexity. It is shown that externally-defined genres differ substantially from each other, and that, using pairs of surface features, such as counts of nouns and pronouns, or of average word lengths and type/token ratios, it is possible to recognize those highly productive genres with a high degree (> 90%) of accuracy. Furthermore, our model can be used to get a quick overview of the structure a corpus, which is very useful when exploring big and diverse corpora. It is also possible to detect errors in the genre annotation of the BNC and develop software for detecting genre differences. By applying it to the Lancaster–Oslo/Bergen Corpus of British English, we also demonstrate that the model generalizes well across corpora of different sizes. Not unexpectedly, native speakers are still found to outperform the model, especially when very short text samples are analysed.
    Originalspråkengelska
    Titel på värdpublikationBig and Rich Data in English Corpus Linguistics : Methods and Explorations
    RedaktörerTuro Hiltunen, Joseph McVeigh, Tanja Säily
    Volym19
    UtgivningsortHelsinki
    FörlagResearch Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki
    Utgivningsdatum2017
    StatusPublicerad - 2017
    MoE-publikationstypA3 Del av bok eller annan forskningsbok

    Publikationsserier

    NamnStudies in Variation, Contacts and Change in English
    FörlagVARIENG
    Volym19
    ISSN (elektroniskt)1797-4453

    Vetenskapsgrenar

    • 6121 Språkvetenskaper

    Citera det här