Stranger than Paradigms: Word Embedding Benchmarks Don't Align With Morphology

Timothee Mickus, Maria Copot

Forskningsoutput: TidskriftsbidragArtikelVetenskapligPeer review

Sammanfattning

Word embeddings have proven a boon in NLP in general, and computational approaches to morphology in particular. However, methods to assess the quality of a word embedding model only tangentially target morphological knowledge, which may lead to suboptimal model selection and biased conclusions in research that employs word embeddings to investigate morphology. In this paper, we empirically test this hypothesis by exhaustively evaluating 1,200 French models with varying hyperparameters on 14 different tasks. Models that perform well on morphology tasks tend to differ from those which succeed on more traditional benchmarks. An especially critical hyperparameter appears to be the negative sampling distribution smoothing exponent: Our study suggest that the common practice of setting it to 0.75 is not appropriate: its optimal value depends on the type of linguistic knowledge being tested.
Originalspråkengelska
TidskriftProceedings of the Society for Computation in Linguistics
Volym7
Sidor (från-till)173–189
Antal sidor17
ISSN2834-1007
DOI
StatusPublicerad - 1 juni 2024
MoE-publikationstypA1 Tidskriftsartikel-refererad

Vetenskapsgrenar

  • 6121 Språkvetenskaper
  • 113 Data- och informationsvetenskap

Citera det här