Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?

Tutkimustuotos: ArtikkelijulkaisuArtikkeliTieteellinenvertaisarvioitu

Kuvaus

Type-token ratio (TTR), also known as vocabulary size divided by text length (V/N) is a simple measure of lexical diversity. It has been used in literary studies, studies in child language and even psychiatry. The basic problem of TTR is that it is affected by the length of the text sample. Several suggestions for improving this fault have been given, including standardizing the length of text samples, using logarithms in the basic formula, etc. We show in this paper that simple TTR and its more elaborate calculation MATTR can be used for approximation of morphological complexity of languages. This usage of TTR has been notified
by Juola with analysis of six languages. We analyse text material with TTR and
MATTR from two differing sources: firstly, text of the EU constitution with 21 languages and secondly with 16 of the same languages with available non-parallel random data from the Leipzig corpus. We compare the automatic analysis results to two independent linguistic measures of morphological complexity. Firstly, we use number of non-homographic noun forms in a language’s inflectional paradigms, the paradigm size. Secondly we use available
inflectional synthesis figures of verbs produced by the AUTOTYP project. We enrich our corpus findings with data from information retrieval (IR) results. It has been suggested that improvements in achieved IR effectiveness with usage of word form variation management depend on the morphological complexity of the languages. Thus this IR gain data can be used to give independent evidence to evaluation of morphological complexity. Our results show that earlier Juola complexity figures and TTR and MATTR calculations correlate moderately
in the EU constitution data. Figures given by TTR and MATTR correlate highly with each other in both corpora, and they also correlate highly with the number of non-homographic noun forms in a language. Correlation to inflectional synthesis of the verbs was found weakly positive in most cases, but the data was scarce. All the three computed measures are able to order the languages quite meaningfully in a morphological complexity order that at least groups most of the languages with same kind of languages and the most and least complex languages are clearly separated. It seems also that TTR and MATTR order the
languages quite consistently with both corpora. In the conclusion we discuss how the complexity figures can be utilized.
Alkuperäiskielienglanti
LehtiJournal of Quantitative Linguistics
Vuosikerta21
Numero3
Sivut223–245
Sivumäärä22
ISSN0929-6174
DOI - pysyväislinkit
TilaJulkaistu - 2014
OKM-julkaisutyyppiA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä, vertaisarvioitu

Tieteenalat

  • 6121 Kielitieteet
  • morfologinen kompleksisuus
  • EU-kielet

Lainaa tätä

@article{4ce4d03c72234ce9834b474683fc340b,
title = "Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?",
abstract = "Type-token ratio (TTR), also known as vocabulary size divided by text length (V/N) is a simple measure of lexical diversity. It has been used in literary studies, studies in child language and even psychiatry. The basic problem of TTR is that it is affected by the length of the text sample. Several suggestions for improving this fault have been given, including standardizing the length of text samples, using logarithms in the basic formula, etc. We show in this paper that simple TTR and its more elaborate calculation MATTR can be used for approximation of morphological complexity of languages. This usage of TTR has been notified by Juola with analysis of six languages. We analyse text material with TTR andMATTR from two differing sources: firstly, text of the EU constitution with 21 languages and secondly with 16 of the same languages with available non-parallel random data from the Leipzig corpus. We compare the automatic analysis results to two independent linguistic measures of morphological complexity. Firstly, we use number of non-homographic noun forms in a language’s inflectional paradigms, the paradigm size. Secondly we use available inflectional synthesis figures of verbs produced by the AUTOTYP project. We enrich our corpus findings with data from information retrieval (IR) results. It has been suggested that improvements in achieved IR effectiveness with usage of word form variation management depend on the morphological complexity of the languages. Thus this IR gain data can be used to give independent evidence to evaluation of morphological complexity. Our results show that earlier Juola complexity figures and TTR and MATTR calculations correlate moderately in the EU constitution data. Figures given by TTR and MATTR correlate highly witheach other in both corpora, and they also correlate highly with the number of non-homographic noun forms in a language. Correlation to inflectional synthesis of the verbs was found weakly positive in most cases, but the data was scarce. All the three computed measures are able to order the languages quite meaningfully in a morphological complexity order that at least groups most of the languages with same kind of languages and the most and least complex languages are clearly separated. It seems also that TTR and MATTR order thelanguages quite consistently with both corpora. In the conclusion we discuss how the complexity figures can be utilized.",
keywords = "6121 Languages, morfologinen kompleksisuus, EU-kielet, morfologinen kompleksisuus, EU-kielet",
author = "Kimmo Kettunen",
year = "2014",
doi = "10.1080/09296174.2014.911506",
language = "English",
volume = "21",
pages = "223–245",
journal = "Journal of Quantitative Linguistics",
issn = "0929-6174",
publisher = "Routledge",
number = "3",

}

Can Type-Token Ratio be Used to Show Morphological Complexity of Languages? / Kettunen, Kimmo.

julkaisussa: Journal of Quantitative Linguistics, Vuosikerta 21, Nro 3, 2014, s. 223–245.

Tutkimustuotos: ArtikkelijulkaisuArtikkeliTieteellinenvertaisarvioitu

TY - JOUR

T1 - Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?

AU - Kettunen, Kimmo

PY - 2014

Y1 - 2014

N2 - Type-token ratio (TTR), also known as vocabulary size divided by text length (V/N) is a simple measure of lexical diversity. It has been used in literary studies, studies in child language and even psychiatry. The basic problem of TTR is that it is affected by the length of the text sample. Several suggestions for improving this fault have been given, including standardizing the length of text samples, using logarithms in the basic formula, etc. We show in this paper that simple TTR and its more elaborate calculation MATTR can be used for approximation of morphological complexity of languages. This usage of TTR has been notified by Juola with analysis of six languages. We analyse text material with TTR andMATTR from two differing sources: firstly, text of the EU constitution with 21 languages and secondly with 16 of the same languages with available non-parallel random data from the Leipzig corpus. We compare the automatic analysis results to two independent linguistic measures of morphological complexity. Firstly, we use number of non-homographic noun forms in a language’s inflectional paradigms, the paradigm size. Secondly we use available inflectional synthesis figures of verbs produced by the AUTOTYP project. We enrich our corpus findings with data from information retrieval (IR) results. It has been suggested that improvements in achieved IR effectiveness with usage of word form variation management depend on the morphological complexity of the languages. Thus this IR gain data can be used to give independent evidence to evaluation of morphological complexity. Our results show that earlier Juola complexity figures and TTR and MATTR calculations correlate moderately in the EU constitution data. Figures given by TTR and MATTR correlate highly witheach other in both corpora, and they also correlate highly with the number of non-homographic noun forms in a language. Correlation to inflectional synthesis of the verbs was found weakly positive in most cases, but the data was scarce. All the three computed measures are able to order the languages quite meaningfully in a morphological complexity order that at least groups most of the languages with same kind of languages and the most and least complex languages are clearly separated. It seems also that TTR and MATTR order thelanguages quite consistently with both corpora. In the conclusion we discuss how the complexity figures can be utilized.

AB - Type-token ratio (TTR), also known as vocabulary size divided by text length (V/N) is a simple measure of lexical diversity. It has been used in literary studies, studies in child language and even psychiatry. The basic problem of TTR is that it is affected by the length of the text sample. Several suggestions for improving this fault have been given, including standardizing the length of text samples, using logarithms in the basic formula, etc. We show in this paper that simple TTR and its more elaborate calculation MATTR can be used for approximation of morphological complexity of languages. This usage of TTR has been notified by Juola with analysis of six languages. We analyse text material with TTR andMATTR from two differing sources: firstly, text of the EU constitution with 21 languages and secondly with 16 of the same languages with available non-parallel random data from the Leipzig corpus. We compare the automatic analysis results to two independent linguistic measures of morphological complexity. Firstly, we use number of non-homographic noun forms in a language’s inflectional paradigms, the paradigm size. Secondly we use available inflectional synthesis figures of verbs produced by the AUTOTYP project. We enrich our corpus findings with data from information retrieval (IR) results. It has been suggested that improvements in achieved IR effectiveness with usage of word form variation management depend on the morphological complexity of the languages. Thus this IR gain data can be used to give independent evidence to evaluation of morphological complexity. Our results show that earlier Juola complexity figures and TTR and MATTR calculations correlate moderately in the EU constitution data. Figures given by TTR and MATTR correlate highly witheach other in both corpora, and they also correlate highly with the number of non-homographic noun forms in a language. Correlation to inflectional synthesis of the verbs was found weakly positive in most cases, but the data was scarce. All the three computed measures are able to order the languages quite meaningfully in a morphological complexity order that at least groups most of the languages with same kind of languages and the most and least complex languages are clearly separated. It seems also that TTR and MATTR order thelanguages quite consistently with both corpora. In the conclusion we discuss how the complexity figures can be utilized.

KW - 6121 Languages

KW - morfologinen kompleksisuus

KW - EU-kielet

KW - morfologinen kompleksisuus

KW - EU-kielet

U2 - 10.1080/09296174.2014.911506

DO - 10.1080/09296174.2014.911506

M3 - Article

VL - 21

SP - 223

EP - 245

JO - Journal of Quantitative Linguistics

JF - Journal of Quantitative Linguistics

SN - 0929-6174

IS - 3

ER -