A simple model for recognizing core genres in the BNC

    Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKirjan luku tai artikkeliTieteellinenvertaisarvioitu

    Abstrakti

    Human communicative practices are organized in terms of genres, and people are highly skilled at recognizing genre differences. In text corpora, genres are typically defined on the basis of text-external features, such as medium, function and format. We show that the core genres of face-to-face conversation, prose fiction, broadsheet newspapers, and academic prose can also be reliably recognized based on a small set of text-internal (linguistic) surface features. Using a 40-million-word subset of the British National Corpus, we study select text-internal surface features that capture language complexity. It is shown that externally-defined genres differ substantially from each other, and that, using pairs of surface features, such as counts of nouns and pronouns, or of average word lengths and type/token ratios, it is possible to recognize those highly productive genres with a high degree (> 90%) of accuracy. Furthermore, our model can be used to get a quick overview of the structure a corpus, which is very useful when exploring big and diverse corpora. It is also possible to detect errors in the genre annotation of the BNC and develop software for detecting genre differences. By applying it to the Lancaster–Oslo/Bergen Corpus of British English, we also demonstrate that the model generalizes well across corpora of different sizes. Not unexpectedly, native speakers are still found to outperform the model, especially when very short text samples are analysed.
    Alkuperäiskielienglanti
    OtsikkoBig and Rich Data in English Corpus Linguistics : Methods and Explorations
    ToimittajatTuro Hiltunen, Joseph McVeigh, Tanja Säily
    Vuosikerta19
    JulkaisupaikkaHelsinki
    KustantajaResearch Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki
    Julkaisupäivä2017
    TilaJulkaistu - 2017
    OKM-julkaisutyyppiA3 Kirjan tai muun kokoomateoksen osa

    Julkaisusarja

    NimiStudies in Variation, Contacts and Change in English
    KustantajaVARIENG
    Vuosikerta19
    ISSN (elektroninen)1797-4453

    Tieteenalat

    • 6121 Kielitieteet

    Siteeraa tätä