Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?

Kimmo Kettunen

Research output: Contribution to journalArticleScientificpeer-review


Type-token ratio (TTR), also known as vocabulary size divided by text length (V/N) is a simple measure of lexical diversity. It has been used in literary studies, studies in child language and even psychiatry. The basic problem of TTR is that it is affected by the length of the text sample. Several suggestions for improving this fault have been given, including standardizing the length of text samples, using logarithms in the basic formula, etc. We show in this paper that simple TTR and its more elaborate calculation MATTR can be used for approximation of morphological complexity of languages. This usage of TTR has been notified by Juola with analysis of six languages. We analyse text material with TTR and
MATTR from two differing sources: firstly, text of the EU constitution with 21 languages and secondly with 16 of the same languages with available non-parallel random data from the Leipzig corpus. We compare the automatic analysis results to two independent linguistic measures of morphological complexity. Firstly, we use number of non-homographic noun forms in a language’s inflectional paradigms, the paradigm size. Secondly we use available inflectional synthesis figures of verbs produced by the AUTOTYP project. We enrich our corpus findings with data from information retrieval (IR) results. It has been suggested that improvements in achieved IR effectiveness with usage of word form variation management depend on the morphological complexity of the languages. Thus this IR gain data can be used to give independent evidence to evaluation of morphological complexity. Our results show that earlier Juola complexity figures and TTR and MATTR calculations correlate moderately in the EU constitution data. Figures given by TTR and MATTR correlate highly with
each other in both corpora, and they also correlate highly with the number of non-homographic noun forms in a language. Correlation to inflectional synthesis of the verbs was found weakly positive in most cases, but the data was scarce. All the three computed measures are able to order the languages quite meaningfully in a morphological complexity order that at least groups most of the languages with same kind of languages and the most and least complex languages are clearly separated. It seems also that TTR and MATTR order the
languages quite consistently with both corpora. In the conclusion we discuss how the complexity figures can be utilized.
Original languageEnglish
JournalJournal of Quantitative Linguistics
Issue number3
Pages (from-to)223–245
Number of pages22
Publication statusPublished - 2014
MoE publication typeA1 Journal article-refereed

Fields of Science

  • 6121 Languages
  • morfologinen kompleksisuus
  • EU-kielet

Cite this