What do Language Representations Really Represent?

Johannes Bjerva, Robert Mikael Östling, Maria Han Veiga, Jörg Tiedemann, Isabelle Augenstein

Forskningsoutput: TidskriftsbidragArtikelVetenskapligPeer review

Sammanfattning

A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just like it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, while genetic relationships---a convenient benchmark used for evaluation in previous work---appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.
Originalspråkengelska
TidskriftComputational Linguistics
ISSN0891-2017
Status!!Accepted/In press - 2019
MoE-publikationstypA1 Tidskriftsartikel-refererad

Vetenskapsgrenar

  • 6121 Språkvetenskaper
  • 113 Data- och informationsvetenskap

Citera det här

Bjerva, J., Östling, R. M., Han Veiga, M., Tiedemann, J., & Augenstein, I. (Accepterad/under tryckning). What do Language Representations Really Represent? Computational Linguistics.
Bjerva, Johannes ; Östling, Robert Mikael ; Han Veiga, Maria ; Tiedemann, Jörg ; Augenstein, Isabelle. / What do Language Representations Really Represent?. I: Computational Linguistics. 2019.
@article{4a007396b8094ef0b3de48306d2e2795,
title = "What do Language Representations Really Represent?",
abstract = "A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just like it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, while genetic relationships---a convenient benchmark used for evaluation in previous work---appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.",
keywords = "6121 Languages, language technology, computational linguistics, 113 Computer and information sciences, natural language processing",
author = "Johannes Bjerva and {\"O}stling, {Robert Mikael} and {Han Veiga}, Maria and J{\"o}rg Tiedemann and Isabelle Augenstein",
year = "2019",
language = "English",
journal = "Computational Linguistics",
issn = "0891-2017",
publisher = "MIT Press",

}

Bjerva, J, Östling, RM, Han Veiga, M, Tiedemann, J & Augenstein, I 2019, 'What do Language Representations Really Represent?' Computational Linguistics.

What do Language Representations Really Represent? / Bjerva, Johannes; Östling, Robert Mikael; Han Veiga, Maria; Tiedemann, Jörg; Augenstein, Isabelle.

I: Computational Linguistics, 2019.

Forskningsoutput: TidskriftsbidragArtikelVetenskapligPeer review

TY - JOUR

T1 - What do Language Representations Really Represent?

AU - Bjerva, Johannes

AU - Östling, Robert Mikael

AU - Han Veiga, Maria

AU - Tiedemann, Jörg

AU - Augenstein, Isabelle

PY - 2019

Y1 - 2019

N2 - A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just like it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, while genetic relationships---a convenient benchmark used for evaluation in previous work---appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.

AB - A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just like it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, while genetic relationships---a convenient benchmark used for evaluation in previous work---appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.

KW - 6121 Languages

KW - language technology

KW - computational linguistics

KW - 113 Computer and information sciences

KW - natural language processing

UR - https://arxiv.org/abs/1901.02646

M3 - Article

JO - Computational Linguistics

JF - Computational Linguistics

SN - 0891-2017

ER -