Testing the Generalization Power of Neural Network Models Across NLI Benchmarks

Aarne Johannes Talman, Stergios Chatzikyriakidis

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Neural network models have been very successful in natural language inference, with the best models reaching 90% accuracy in some benchmarks. However, the success of these models turns out to be largely benchmark specific. We show that models trained on a natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmarks is the same or similar. We train six high performing neural network models on different datasets and show that each one of these has problems of generalizing when we replace the original test set with a test set taken from another corpus designed for the same task. In light of these results, we argue that most of the current neural network models are not able to generalize well in the task of natural language inference. We find that using large pre-trained language models helps with transfer learning when the datasets are similar enough. Our results also highlight that the current NLI datasets do not cover the different nuances of inference extensively enough.
Original languageEnglish
Title of host publicationThe Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019 : Proceedings of the Second Workshop
EditorsTal Linzen, Grzegorz Chrupała, Yonatan Belinkov, Dieuwke Hupkes
Number of pages10
Place of PublicationStroudsburg
PublisherThe Association for Computational Linguistics
Publication date1 Aug 2019
Pages85-94
ISBN (Electronic)978-1-950737-30-7
Publication statusPublished - 1 Aug 2019
MoE publication typeA4 Article in conference proceedings
Event2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP - Florence, Italy
Duration: 1 Aug 20191 Aug 2019
Conference number: 2

Fields of Science

  • 113 Computer and information sciences
  • 6121 Languages

Cite this

Talman, A. J., & Chatzikyriakidis, S. (2019). Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. In T. Linzen, G. Chrupała, Y. Belinkov, & D. Hupkes (Eds.), The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop (pp. 85-94). Stroudsburg: The Association for Computational Linguistics.
Talman, Aarne Johannes ; Chatzikyriakidis, Stergios. / Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop. editor / Tal Linzen ; Grzegorz Chrupała ; Yonatan Belinkov ; Dieuwke Hupkes. Stroudsburg : The Association for Computational Linguistics, 2019. pp. 85-94
@inproceedings{bc7912f9a49a4d1e8a83ab3b37164efa,
title = "Testing the Generalization Power of Neural Network Models Across NLI Benchmarks",
abstract = "Neural network models have been very successful in natural language inference, with the best models reaching 90{\%} accuracy in some benchmarks. However, the success of these models turns out to be largely benchmark specific. We show that models trained on a natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmarks is the same or similar. We train six high performing neural network models on different datasets and show that each one of these has problems of generalizing when we replace the original test set with a test set taken from another corpus designed for the same task. In light of these results, we argue that most of the current neural network models are not able to generalize well in the task of natural language inference. We find that using large pre-trained language models helps with transfer learning when the datasets are similar enough. Our results also highlight that the current NLI datasets do not cover the different nuances of inference extensively enough.",
keywords = "113 Computer and information sciences, 6121 Languages",
author = "Talman, {Aarne Johannes} and Stergios Chatzikyriakidis",
year = "2019",
month = "8",
day = "1",
language = "English",
pages = "85--94",
editor = "Linzen, {Tal } and Grzegorz Chrupała and Belinkov, {Yonatan } and Hupkes, {Dieuwke }",
booktitle = "The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019",
publisher = "The Association for Computational Linguistics",
address = "United States",

}

Talman, AJ & Chatzikyriakidis, S 2019, Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. in T Linzen, G Chrupała, Y Belinkov & D Hupkes (eds), The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop. The Association for Computational Linguistics, Stroudsburg, pp. 85-94, 2019 ACL Workshop BlackboxNLP, Florence, Italy, 01/08/2019.

Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. / Talman, Aarne Johannes; Chatzikyriakidis, Stergios.

The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop. ed. / Tal Linzen; Grzegorz Chrupała; Yonatan Belinkov; Dieuwke Hupkes. Stroudsburg : The Association for Computational Linguistics, 2019. p. 85-94.

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Testing the Generalization Power of Neural Network Models Across NLI Benchmarks

AU - Talman, Aarne Johannes

AU - Chatzikyriakidis, Stergios

PY - 2019/8/1

Y1 - 2019/8/1

N2 - Neural network models have been very successful in natural language inference, with the best models reaching 90% accuracy in some benchmarks. However, the success of these models turns out to be largely benchmark specific. We show that models trained on a natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmarks is the same or similar. We train six high performing neural network models on different datasets and show that each one of these has problems of generalizing when we replace the original test set with a test set taken from another corpus designed for the same task. In light of these results, we argue that most of the current neural network models are not able to generalize well in the task of natural language inference. We find that using large pre-trained language models helps with transfer learning when the datasets are similar enough. Our results also highlight that the current NLI datasets do not cover the different nuances of inference extensively enough.

AB - Neural network models have been very successful in natural language inference, with the best models reaching 90% accuracy in some benchmarks. However, the success of these models turns out to be largely benchmark specific. We show that models trained on a natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmarks is the same or similar. We train six high performing neural network models on different datasets and show that each one of these has problems of generalizing when we replace the original test set with a test set taken from another corpus designed for the same task. In light of these results, we argue that most of the current neural network models are not able to generalize well in the task of natural language inference. We find that using large pre-trained language models helps with transfer learning when the datasets are similar enough. Our results also highlight that the current NLI datasets do not cover the different nuances of inference extensively enough.

KW - 113 Computer and information sciences

KW - 6121 Languages

M3 - Conference contribution

SP - 85

EP - 94

BT - The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019

A2 - Linzen, Tal

A2 - Chrupała, Grzegorz

A2 - Belinkov, Yonatan

A2 - Hupkes, Dieuwke

PB - The Association for Computational Linguistics

CY - Stroudsburg

ER -

Talman AJ, Chatzikyriakidis S. Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. In Linzen T, Chrupała G, Belinkov Y, Hupkes D, editors, The Workshop BlackboxNLP on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop. Stroudsburg: The Association for Computational Linguistics. 2019. p. 85-94