Can Language Models Identify Wikipedia Articles with Readability and Style Issues?

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Abstrakti

Wikipedia is frequently criticised for having poor readability and style issues. In this article, we investigate using GPT-2, a neural language model, to identify poorly written text in Wikipedia by ranking documents by their perplexity. We evaluated the properties of this ranking using human assessments of text quality, including readability, narrativity and language use. We demonstrate that GPT-2 perplexity scores correlate moderately to strongly with narrativity, but only weakly with reading comprehension scores. Importantly, the model reflects even small improvements to text as would be seen in Wikipedia edits. We conclude by highlighting that Wikipedia's featured articles counter-intuitively contain text with the highest perplexity scores. However, these examples highlight many of the complexities that need to be resolved for such an approach to be used in practice.
Alkuperäiskielienglanti
OtsikkoICTIR '21: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval
Sivumäärä5
KustantajaAssociation for Computing Machinery
Julkaisupäiväelok. 2021
Sivut113-117
ISBN (elektroninen)978-1-4503-8611-1
DOI - pysyväislinkit
TilaJulkaistu - elok. 2021
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaInternational Conference on the Theory of Information Retrieval -
Kesto: 11 heinäk. 202111 heinäk. 2021
Konferenssinumero: 11

Tieteenalat

  • 113 Tietojenkäsittely- ja informaatiotieteet

Siteeraa tätä