NorQuAD: Norwegian Question Answering Dataset

Sardana Ivanova, Fredrik Andreassen, Matias Jentoft, Sondre Wold, Lilja Øvrelid

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

In this paper we present NorQuAD: the first Norwegian question answering dataset for machine reading comprehension. The dataset consists of 4,752 manually created question-answer pairs. We here detail the data collection procedure and present statistics of the dataset. We also benchmark several multilingual and Norwegian monolingual language models on the dataset and compare them against human performance. The dataset will be made freely available.
Original languageEnglish
Title of host publicationProceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Number of pages10
PublisherUniversity of Tartu Library
Publication dateMay 2023
Pages159-168
ISBN (Electronic)978-9916-21-999-7
Publication statusPublished - May 2023
MoE publication typeA4 Article in conference proceedings
EventNordic Conference on Computational Linguistics - Tórshavn, Faroe Islands
Duration: 22 May 202324 May 2023
Conference number: 24

Publication series

NameNEALT Proceedings Series
PublisherUniversity of Tartu Library
Number52
ISSN (Electronic)1736-6305

Fields of Science

  • 113 Computer and information sciences
  • 6121 Languages

Cite this