Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers

Kimmo Tapio Kettunen, Tuula Anneli Pääkkönen, Jani Mika Olavi Koistinen

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    Abstract

    The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year [3]. The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4, 5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.
    Original languageEnglish
    Title of host publicationHuman Language Technologies – The Baltic Perspective : Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016)
    Number of pages8
    Place of PublicationAmsterdam
    PublisherIOS PRESS
    Publication date2016
    Pages122-129
    ISBN (Print)978-1-61499-700-9
    ISBN (Electronic)978-1-61499-701-6
    DOIs
    Publication statusPublished - 2016
    MoE publication typeA4 Article in conference proceedings
    EventInternational Conference Human Language Technologies : The Baltic Perspective - Riika, Latvia
    Duration: 6 Oct 20167 Oct 2016
    Conference number: 7

    Publication series

    NameFrontiers in Artificial Intelligence and Applications
    PublisherIOS Press
    Number289
    ISSN (Print)0922-6389
    ISSN (Electronic)1879-8314

    Fields of Science

    • 113 Computer and information sciences
    • 6121 Languages

    Cite this