Significance testing of word frequencies in corpora

Jefrey Lijffijt, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, Heikki Mannila

Tutkimustuotos: ArtikkelijulkaisuArtikkeliTieteellinenvertaisarvioitu

Kuvaus

Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory, 2005; 1(2): 263–76.), the use of the X2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics, 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse. Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank sum test, or bootstrap test for comparing word frequencies across corpora.
Alkuperäiskielienglanti
LehtiDigital Scholarship in the Humanities : DSH
Vuosikerta31
Numero2
Sivut374-397
Sivumäärä24
ISSN2055-7671
DOI - pysyväislinkit
TilaJulkaistu - 2016
OKM-julkaisutyyppiA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä, vertaisarvioitu

Tieteenalat

  • 113 Tietojenkäsittely- ja informaatiotieteet
  • 6121 Kielitieteet

Lainaa tätä

Lijffijt, Jefrey ; Nevalainen, Terttu ; Säily, Tanja ; Papapetrou, Panagiotis ; Puolamäki, Kai ; Mannila, Heikki. / Significance testing of word frequencies in corpora. Julkaisussa: Digital Scholarship in the Humanities : DSH . 2016 ; Vuosikerta 31, Nro 2. Sivut 374-397.
@article{dbfc0409e16a43e28f2d4fcc890b4ff4,
title = "Significance testing of word frequencies in corpora",
abstract = "Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory, 2005; 1(2): 263–76.), the use of the X2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics, 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse. Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank sum test, or bootstrap test for comparing word frequencies across corpora.",
keywords = "113 Computer and information sciences, significance testing, bootstrap, chi-square test, log-likelihood ratio test, keywords, 6121 Languages, corpus linguistics, text corpora, British National Corpus",
author = "Jefrey Lijffijt and Terttu Nevalainen and Tanja S{\"a}ily and Panagiotis Papapetrou and Kai Puolam{\"a}ki and Heikki Mannila",
year = "2016",
doi = "10.1093/llc/fqu064",
language = "English",
volume = "31",
pages = "374--397",
journal = "Digital Scholarship in the Humanities : DSH",
issn = "2055-7671",
publisher = "Oxford University Press",
number = "2",

}

Significance testing of word frequencies in corpora. / Lijffijt, Jefrey; Nevalainen, Terttu; Säily, Tanja; Papapetrou, Panagiotis; Puolamäki, Kai; Mannila, Heikki.

julkaisussa: Digital Scholarship in the Humanities : DSH , Vuosikerta 31, Nro 2, 2016, s. 374-397.

Tutkimustuotos: ArtikkelijulkaisuArtikkeliTieteellinenvertaisarvioitu

TY - JOUR

T1 - Significance testing of word frequencies in corpora

AU - Lijffijt, Jefrey

AU - Nevalainen, Terttu

AU - Säily, Tanja

AU - Papapetrou, Panagiotis

AU - Puolamäki, Kai

AU - Mannila, Heikki

PY - 2016

Y1 - 2016

N2 - Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory, 2005; 1(2): 263–76.), the use of the X2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics, 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse. Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank sum test, or bootstrap test for comparing word frequencies across corpora.

AB - Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory, 2005; 1(2): 263–76.), the use of the X2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics, 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse. Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank sum test, or bootstrap test for comparing word frequencies across corpora.

KW - 113 Computer and information sciences

KW - significance testing

KW - bootstrap

KW - chi-square test

KW - log-likelihood ratio test

KW - keywords

KW - 6121 Languages

KW - corpus linguistics

KW - text corpora

KW - British National Corpus

U2 - 10.1093/llc/fqu064

DO - 10.1093/llc/fqu064

M3 - Article

VL - 31

SP - 374

EP - 397

JO - Digital Scholarship in the Humanities : DSH

JF - Digital Scholarship in the Humanities : DSH

SN - 2055-7671

IS - 2

ER -