Testing the Generalization Power of Neural Network Models Across NLI Benchmarks

Aarne Johannes Talman, Stergios Chatzikyriakidis

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Kuvaus

Neural network models have been very successful in natural language inference, with the best models reaching 90% accuracy in some benchmarks. However, the success of these models turns out to be largely benchmark specific. We show that models trained on a natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmarks is the same or similar. We train six high performing neural network models on different datasets and show that each one of these has problems of generalizing when we replace the original test set with a test set taken from another corpus designed for the same task. In light of these results, we argue that most of the current neural network models are not able to generalize well in the task of natural language inference. We find that using large pre-trained language models helps with transfer learning when the datasets are similar enough. Our results also highlight that the current NLI datasets do not cover the different nuances of inference extensively enough.
Alkuperäiskielienglanti
OtsikkoThe Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019 : Proceedings of the Second Workshop
ToimittajatTal Linzen, Grzegorz Chrupała, Yonatan Belinkov, Dieuwke Hupkes
Sivumäärä10
JulkaisupaikkaStroudsburg
KustantajaThe Association for Computational Linguistics
Julkaisupäivä1 elokuuta 2019
Sivut85-94
ISBN (elektroninen)978-1-950737-30-7
TilaJulkaistu - 1 elokuuta 2019
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
Tapahtuma2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP - Florence, Italia
Kesto: 1 elokuuta 20191 elokuuta 2019
Konferenssinumero: 2

Tieteenalat

  • 113 Tietojenkäsittely- ja informaatiotieteet
  • 6121 Kielitieteet

Lainaa tätä

Talman, A. J., & Chatzikyriakidis, S. (2019). Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. teoksessa T. Linzen, G. Chrupała, Y. Belinkov, & D. Hupkes (Toimittajat), The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop (Sivut 85-94). Stroudsburg: The Association for Computational Linguistics.
Talman, Aarne Johannes ; Chatzikyriakidis, Stergios. / Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop. Toimittaja / Tal Linzen ; Grzegorz Chrupała ; Yonatan Belinkov ; Dieuwke Hupkes. Stroudsburg : The Association for Computational Linguistics, 2019. Sivut 85-94
@inproceedings{bc7912f9a49a4d1e8a83ab3b37164efa,
title = "Testing the Generalization Power of Neural Network Models Across NLI Benchmarks",
abstract = "Neural network models have been very successful in natural language inference, with the best models reaching 90{\%} accuracy in some benchmarks. However, the success of these models turns out to be largely benchmark specific. We show that models trained on a natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmarks is the same or similar. We train six high performing neural network models on different datasets and show that each one of these has problems of generalizing when we replace the original test set with a test set taken from another corpus designed for the same task. In light of these results, we argue that most of the current neural network models are not able to generalize well in the task of natural language inference. We find that using large pre-trained language models helps with transfer learning when the datasets are similar enough. Our results also highlight that the current NLI datasets do not cover the different nuances of inference extensively enough.",
keywords = "113 Computer and information sciences, 6121 Languages",
author = "Talman, {Aarne Johannes} and Stergios Chatzikyriakidis",
year = "2019",
month = "8",
day = "1",
language = "English",
pages = "85--94",
editor = "Linzen, {Tal } and Grzegorz Chrupała and Belinkov, {Yonatan } and Hupkes, {Dieuwke }",
booktitle = "The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019",
publisher = "The Association for Computational Linguistics",
address = "United States",

}

Talman, AJ & Chatzikyriakidis, S 2019, Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. julkaisussa T Linzen, G Chrupała, Y Belinkov & D Hupkes (toim), The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop. The Association for Computational Linguistics, Stroudsburg, Sivut 85-94, 2019 ACL Workshop BlackboxNLP, Florence, Italia, 01/08/2019.

Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. / Talman, Aarne Johannes; Chatzikyriakidis, Stergios.

The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop. toim. / Tal Linzen; Grzegorz Chrupała; Yonatan Belinkov; Dieuwke Hupkes. Stroudsburg : The Association for Computational Linguistics, 2019. s. 85-94.

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

TY - GEN

T1 - Testing the Generalization Power of Neural Network Models Across NLI Benchmarks

AU - Talman, Aarne Johannes

AU - Chatzikyriakidis, Stergios

PY - 2019/8/1

Y1 - 2019/8/1

N2 - Neural network models have been very successful in natural language inference, with the best models reaching 90% accuracy in some benchmarks. However, the success of these models turns out to be largely benchmark specific. We show that models trained on a natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmarks is the same or similar. We train six high performing neural network models on different datasets and show that each one of these has problems of generalizing when we replace the original test set with a test set taken from another corpus designed for the same task. In light of these results, we argue that most of the current neural network models are not able to generalize well in the task of natural language inference. We find that using large pre-trained language models helps with transfer learning when the datasets are similar enough. Our results also highlight that the current NLI datasets do not cover the different nuances of inference extensively enough.

AB - Neural network models have been very successful in natural language inference, with the best models reaching 90% accuracy in some benchmarks. However, the success of these models turns out to be largely benchmark specific. We show that models trained on a natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmarks is the same or similar. We train six high performing neural network models on different datasets and show that each one of these has problems of generalizing when we replace the original test set with a test set taken from another corpus designed for the same task. In light of these results, we argue that most of the current neural network models are not able to generalize well in the task of natural language inference. We find that using large pre-trained language models helps with transfer learning when the datasets are similar enough. Our results also highlight that the current NLI datasets do not cover the different nuances of inference extensively enough.

KW - 113 Computer and information sciences

KW - 6121 Languages

M3 - Conference contribution

SP - 85

EP - 94

BT - The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019

A2 - Linzen, Tal

A2 - Chrupała, Grzegorz

A2 - Belinkov, Yonatan

A2 - Hupkes, Dieuwke

PB - The Association for Computational Linguistics

CY - Stroudsburg

ER -

Talman AJ, Chatzikyriakidis S. Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. julkaisussa Linzen T, Chrupała G, Belinkov Y, Hupkes D, toimittajat, The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop. Stroudsburg: The Association for Computational Linguistics. 2019. s. 85-94