À la recherche du nom perdu – Searching for Named Entities with Stanford NER in a Finnish Historical Newspaper and Journal Collection

Teemu Petteri Ruokolainen, Kimmo Tapio Kettunen

Research output: Conference materialsPaperpeer-review


This paper presents work that has been carried out in the National Library of Finland to detect names of locations and persons in a Finnish historical newspaper and journal collection 1771–1920. Work and results reported in the paper are based on a 500 000 word ground truth (GT) sample of the Finnish language part of the whole collection with different Optical Character Recognition quality.

Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. Performance of a NER system is usually heavily genre and domain dependent. Entity categories used in NER may also vary. The most used set of named entity categories is usually some version of three partite categorization of locations, persons and organizations [1].

In our work we use a standard trainable statistical NER engine, Stanford NER . Considering the quality of our data and complexities of Finnish language, our NER results can be considered as good. With our ground truth data we achieve F-score of 0.89 with locations and 0.84 with persons. With re-OCRed Tesseract v. 3.04.01 output the F-score results are 0.79 and 0.72, respectively, for locations and persons.
Original languageEnglish
Number of pages2
Publication statusPublished - Apr 2018
MoE publication typeNot Eligible
EventIAPR International Workshop on Document Analysis System: DAS 2018 - Wien, Austria
Duration: 24 Apr 201827 Apr 2018
Conference number: 13


WorkshopIAPR International Workshop on Document Analysis System

Fields of Science

  • 518 Media and communications

Cite this