Abstract
We approach the problem of recognition and attribution of quotes in Finnish news media. Solving this task would create possibilities for large-scale analysis of media wrt. the presence and styles of presentation of different voices and opinions. We describe the annotation of a corpus of media texts, numbering around 1500 articles, with quote attribution and coreference information. Further, we compare two methods for automatic quote recognition: a rule-based one operating on dependency trees and a machine learning one built on top of the BERT language model. We conclude that BERT provides more promising results even with little training data, achieving 95% F-score on direct quote recognition and 84% for indirect quotes. Finally, we discuss open problems and further associated tasks, especially the necessity of resolving speaker mentions to entity references.
Original language | English |
---|---|
Title of host publication | Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) |
Editors | Tanel Alumäe, Mark Fishel |
Number of pages | 8 |
Place of Publication | Tartu |
Publisher | University of Tartu Library |
Publication date | May 2023 |
Pages | 52-59 |
ISBN (Electronic) | 978-99-1621-999-7 |
Publication status | Published - May 2023 |
MoE publication type | A4 Article in conference proceedings |
Event | Nordic Conference on Computational Linguistics - Tórshavn, Faroe Islands Duration: 22 May 2023 → 24 May 2023 Conference number: 24 |
Publication series
Name | NEALT Proceedings Series |
---|---|
Publisher | University of Tartu Library |
Number | 52 |
ISSN (Print) | 1736-8197 |
ISSN (Electronic) | 1736-6305 |
Fields of Science
- 6121 Languages
- 113 Computer and information sciences