Can Language Models Identify Wikipedia Articles with Readability and Style Issues?

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

Wikipedia is frequently criticised for having poor readability and style issues. In this article, we investigate using GPT-2, a neural language model, to identify poorly written text in Wikipedia by ranking documents by their perplexity. We evaluated the properties of this ranking using human assessments of text quality, including readability, narrativity and language use. We demonstrate that GPT-2 perplexity scores correlate moderately to strongly with narrativity, but only weakly with reading comprehension scores. Importantly, the model reflects even small improvements to text as would be seen in Wikipedia edits. We conclude by highlighting that Wikipedia's featured articles counter-intuitively contain text with the highest perplexity scores. However, these examples highlight many of the complexities that need to be resolved for such an approach to be used in practice.
Originalspråkengelska
Titel på värdpublikationICTIR '21: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval
Antal sidor5
FörlagAssociation for Computing Machinery
Utgivningsdatumaug. 2021
Sidor113-117
ISBN (elektroniskt)978-1-4503-8611-1
DOI
StatusPublicerad - aug. 2021
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangInternational Conference on the Theory of Information Retrieval -
Varaktighet: 11 juli 202111 juli 2021
Konferensnummer: 11

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap

Citera det här