Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger

Kimmo Tapio Kettunen, Laura Löfberg

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations.

In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pääkkönen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper.

Our new evaluated tool for NER tagging is non-conventional: it is a rule-based semantic tagger of Finnish, the FST (Löfberg et al., 2005), and its results are compared to those of a standard rule-based NE tagger, FiNER. The FST achieves up to 55–61 F-score with locations and F-score of 51–52 with persons with the historical newspaper data, and its performance is comparative to FiNER with locations. With the modern Finnish technology news of Digitoday FiNER achieves F-scores of up to 79 with locations at best. Person names show worst performance; their F-score varies from 33 to 66. The FST performs equally well as FiNER with Digitoday’s location names, but is worse with persons. With corporations, FST is at its worst, while FiNER performs reasonably well.

Overall our results show that a general semantic tool like the FST is able to perform in a restricted semantic task of name recognition almost as well as a dedicated NE tagger. As NER is a popular task in information extraction and retrieval, our results show that NE tagging does not need to be only a task of dedicated NE taggers, but it can be performed equally well with more general multipurpose semantic tools.
Original languageEnglish
Title of host publication Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
EditorsJörg Tiedemann
Number of pages8
Place of PublicationLinköping
PublisherLinköping University Electronic Press
Publication dateMay 2017
Pages29-36
ISBN (Electronic)978-91-7685-601-7
Publication statusPublished - May 2017
MoE publication typeA4 Article in conference proceedings
EventNordic Conference on Computational Linguistics - Gothenburg, Sweden
Duration: 22 May 201724 May 2017
Conference number: 21 (NoDaLiDa)

Publication series

NameLinköping Electronic Conference Proceedings
PublisherLinköping University Electronic Press, Linköpings universitet
Volume131
ISSN (Print)1650-3686
ISSN (Electronic)1650-3740
NameNEALT Proceedings Series
Volume29

Fields of Science

  • 113 Computer and information sciences

Cite this

Kettunen, K. T., & Löfberg, L. (2017). Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger. In J. Tiedemann (Ed.), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden (pp. 29-36). (Linköping Electronic Conference Proceedings; Vol. 131), (NEALT Proceedings Series; Vol. 29). Linköping: Linköping University Electronic Press.
Kettunen, Kimmo Tapio ; Löfberg, Laura. / Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger. Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. editor / Jörg Tiedemann. Linköping : Linköping University Electronic Press, 2017. pp. 29-36 (Linköping Electronic Conference Proceedings). (NEALT Proceedings Series).
@inproceedings{97c0d41132d9447f8b8abd88deccbdd4,
title = "Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger",
abstract = "Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations. In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75{\%}, and its NER evaluation collection consists of 75 931 words (Kettunen and P{\"a}{\"a}kk{\"o}nen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper. Our new evaluated tool for NER tagging is non-conventional: it is a rule-based semantic tagger of Finnish, the FST (L{\"o}fberg et al., 2005), and its results are compared to those of a standard rule-based NE tagger, FiNER. The FST achieves up to 55–61 F-score with locations and F-score of 51–52 with persons with the historical newspaper data, and its performance is comparative to FiNER with locations. With the modern Finnish technology news of Digitoday FiNER achieves F-scores of up to 79 with locations at best. Person names show worst performance; their F-score varies from 33 to 66. The FST performs equally well as FiNER with Digitoday’s location names, but is worse with persons. With corporations, FST is at its worst, while FiNER performs reasonably well.Overall our results show that a general semantic tool like the FST is able to perform in a restricted semantic task of name recognition almost as well as a dedicated NE tagger. As NER is a popular task in information extraction and retrieval, our results show that NE tagging does not need to be only a task of dedicated NE taggers, but it can be performed equally well with more general multipurpose semantic tools.",
keywords = "113 Computer and information sciences",
author = "Kettunen, {Kimmo Tapio} and Laura L{\"o}fberg",
note = "Volume: Proceeding volume:",
year = "2017",
month = "5",
language = "English",
series = "Link{\"o}ping Electronic Conference Proceedings",
publisher = "Link{\"o}ping University Electronic Press",
pages = "29--36",
editor = "J{\"o}rg Tiedemann",
booktitle = "Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden",
address = "Sweden",

}

Kettunen, KT & Löfberg, L 2017, Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger. in J Tiedemann (ed.), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. Linköping Electronic Conference Proceedings, vol. 131, NEALT Proceedings Series, vol. 29, Linköping University Electronic Press, Linköping, pp. 29-36, Nordic Conference on Computational Linguistics, Gothenburg, Sweden, 22/05/2017.

Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger. / Kettunen, Kimmo Tapio; Löfberg, Laura.

Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. ed. / Jörg Tiedemann. Linköping : Linköping University Electronic Press, 2017. p. 29-36 (Linköping Electronic Conference Proceedings; Vol. 131), (NEALT Proceedings Series; Vol. 29).

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger

AU - Kettunen, Kimmo Tapio

AU - Löfberg, Laura

N1 - Volume: Proceeding volume:

PY - 2017/5

Y1 - 2017/5

N2 - Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations. In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pääkkönen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper. Our new evaluated tool for NER tagging is non-conventional: it is a rule-based semantic tagger of Finnish, the FST (Löfberg et al., 2005), and its results are compared to those of a standard rule-based NE tagger, FiNER. The FST achieves up to 55–61 F-score with locations and F-score of 51–52 with persons with the historical newspaper data, and its performance is comparative to FiNER with locations. With the modern Finnish technology news of Digitoday FiNER achieves F-scores of up to 79 with locations at best. Person names show worst performance; their F-score varies from 33 to 66. The FST performs equally well as FiNER with Digitoday’s location names, but is worse with persons. With corporations, FST is at its worst, while FiNER performs reasonably well.Overall our results show that a general semantic tool like the FST is able to perform in a restricted semantic task of name recognition almost as well as a dedicated NE tagger. As NER is a popular task in information extraction and retrieval, our results show that NE tagging does not need to be only a task of dedicated NE taggers, but it can be performed equally well with more general multipurpose semantic tools.

AB - Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations. In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pääkkönen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper. Our new evaluated tool for NER tagging is non-conventional: it is a rule-based semantic tagger of Finnish, the FST (Löfberg et al., 2005), and its results are compared to those of a standard rule-based NE tagger, FiNER. The FST achieves up to 55–61 F-score with locations and F-score of 51–52 with persons with the historical newspaper data, and its performance is comparative to FiNER with locations. With the modern Finnish technology news of Digitoday FiNER achieves F-scores of up to 79 with locations at best. Person names show worst performance; their F-score varies from 33 to 66. The FST performs equally well as FiNER with Digitoday’s location names, but is worse with persons. With corporations, FST is at its worst, while FiNER performs reasonably well.Overall our results show that a general semantic tool like the FST is able to perform in a restricted semantic task of name recognition almost as well as a dedicated NE tagger. As NER is a popular task in information extraction and retrieval, our results show that NE tagging does not need to be only a task of dedicated NE taggers, but it can be performed equally well with more general multipurpose semantic tools.

KW - 113 Computer and information sciences

M3 - Conference contribution

T3 - Linköping Electronic Conference Proceedings

SP - 29

EP - 36

BT - Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

A2 - Tiedemann, Jörg

PB - Linköping University Electronic Press

CY - Linköping

ER -

Kettunen KT, Löfberg L. Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger. In Tiedemann J, editor, Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. Linköping: Linköping University Electronic Press. 2017. p. 29-36. (Linköping Electronic Conference Proceedings). (NEALT Proceedings Series).