Projekteja vuodessa
Abstrakti
In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents. The ECCO dataset poses unique modelling challenges due to the presence of Optical Character Recognition (OCR) artifacts. We establish the performance of the BERT model on a publication year prediction task against linear baseline models and human judgement, finding the BERT model to be superior to both and able to date the works, on average, with less than 7 years absolute error. We also explore how language change over time affects the model by analyzing the features the model uses for publication year predictions as given by the Integrated Gradients model explanation method.
Alkuperäiskieli | englanti |
---|---|
Otsikko | PROCEEDINGS OF THE 3RD INTERNATIONAL WORKSHOP ON COMPUTATIONAL APPROACHES TO HISTORICAL LANGUAGE CHANGE 2022 (LCHANGE 2022) |
Toimittajat | Nina Tahmasebi, Syrielle Montariol, Andrey Kutuzov, Simon Hengchen, Haim Dubossarsky, Lars Borin |
Sivumäärä | 10 |
Julkaisupaikka | Stroudsburg |
Kustantaja | The Association for Computational Linguistics |
Julkaisupäivä | 2022 |
Sivut | 68–77 |
ISBN (elektroninen) | 978-1-955917-42-1 |
Tila | Julkaistu - 2022 |
OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisuussa |
Tapahtuma | Workshop on Computational Approaches to Historical Language Change - [Hybrid event], Dublin, Irlanti Kesto: 26 toukok. 2022 → 27 toukok. 2022 Konferenssinumero: 3 |
Tieteenalat
- 6121 Kielitieteet
- 113 Tietojenkäsittely- ja informaatiotieteet
Projektit
- 1 Aktiivinen
-
HPC-HD: High Performance Computing for the Detection and Analysis of Historical Discourses
Tolonen, M., Mäkelä, E., Mathioudakis, M., Ginter, F. & Babbar, R.
Academy of Finland, Suomen Akatemia Projektilaskutus
01/01/2022 → 31/12/2024
Projekti: Tutkimusprojekti