Measuring Language Closeness by Modeling Regularity

Javad Nouri, Roman Yangarber

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review


This paper addresses the problems of measuring similarity between languages—where the term language covers any of the senses denoted by language, dialect or linguistic variety, as defined by any theory. We argue that to devise an effective way to measure the similarity between languages one should build a probabilistic model that tries to capture as much regular correspondence between the languages as possible. This approach yields two benefits. First, given a set of language data, for any two models, this gives a way of objectively determining which model is better, i.e., which model is more likely to be accurate and informative. Second, given a model, for any two languages we can determine, in a principled way, how close they are. The better models will be better at judging similarity. We present experiments on data from three language families to support these ideas. In particular, our results demonstrate the arbitrary nature of terms such as language vs. dialect, when applied to related languages.
Titel på värdpublikationProceedings of the EMNLP’2014Workshop : Language Technology for Closely Related Languages and Language Variants
Antal sidor10
Utgivningsdatumokt. 2014
ISBN (tryckt)978-1-937284-96-1
StatusPublicerad - okt. 2014
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangLanguage Technology for Closely Related Languages and Language Variants - Doha, Qatar
Varaktighet: 29 okt. 201429 okt. 2014
Konferensnummer: EMNLP 2014 (LT4CloseLang)


  • 113 Data- och informationsvetenskap

Citera det här