À la recherche du nom perdu – Searching for Named Entities with Stanford NER in a Finnish Historical Newspaper and Journal Collection

Teemu Petteri Ruokolainen, Kimmo Tapio Kettunen

Tutkimustuotos: KonferenssimateriaalitKonferenssiesitysvertaisarvioitu


This paper presents work that has been carried out in the National Library of Finland to detect names of locations and persons in a Finnish historical newspaper and journal collection 1771–1920. Work and results reported in the paper are based on a 500 000 word ground truth (GT) sample of the Finnish language part of the whole collection with different Optical Character Recognition quality.

Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. Performance of a NER system is usually heavily genre and domain dependent. Entity categories used in NER may also vary. The most used set of named entity categories is usually some version of three partite categorization of locations, persons and organizations [1].

In our work we use a standard trainable statistical NER engine, Stanford NER . Considering the quality of our data and complexities of Finnish language, our NER results can be considered as good. With our ground truth data we achieve F-score of 0.89 with locations and 0.84 with persons. With re-OCRed Tesseract v. 3.04.01 output the F-score results are 0.79 and 0.72, respectively, for locations and persons.
TilaJulkaistu - huhtik. 2018
OKM-julkaisutyyppiEi sovellu
TapahtumaIAPR International Workshop on Document Analysis System: DAS 2018 - Wien, Itävalta
Kesto: 24 huhtik. 201827 huhtik. 2018
Konferenssinumero: 13


TyöpajaIAPR International Workshop on Document Analysis System


  • 518 Media- ja viestintätieteet

Siteeraa tätä