Between Diachrony and Synchrony

Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year [3]. The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4, 5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.
Original languageEnglish
Title of host publicationHuman Language Technologies – The Baltic Perspective : Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016)
Number of pages8
Place of PublicationAmsterdam
PublisherIOS PRESS
Publication date2016
Pages122-129
ISBN (Print)978-1-61499-700-9
ISBN (Electronic)978-1-61499-701-6
DOIs
Publication statusPublished - 2016
MoE publication typeA4 Article in conference proceedings
EventInternational Conference Human Language Technologies : The Baltic Perspective - Riika, Latvia
Duration: 6 Oct 20167 Oct 2016
Conference number: 7

Publication series

NameFrontiers in Artificial Intelligence and Applications
PublisherIOS Press
Number289
ISSN (Print)0922-6389
ISSN (Electronic)1879-8314

Fields of Science

  • 113 Computer and information sciences
  • 6121 Languages

Cite this

Kettunen, K. T., Pääkkönen, T. A., & Koistinen, J. M. O. (2016). Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. In Human Language Technologies – The Baltic Perspective: Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016) (pp. 122-129). (Frontiers in Artificial Intelligence and Applications; No. 289). Amsterdam: IOS PRESS. https://doi.org/10.3233/978-1-61499-701-6-122
Kettunen, Kimmo Tapio ; Pääkkönen, Tuula Anneli ; Koistinen, Jani Mika Olavi. / Between Diachrony and Synchrony : Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. Human Language Technologies – The Baltic Perspective: Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016). Amsterdam : IOS PRESS, 2016. pp. 122-129 (Frontiers in Artificial Intelligence and Applications; 289).
@inproceedings{8827f3dcefcd4c618d1933b47d6dfc59,
title = "Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers",
abstract = "The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year [3]. The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4, 5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.",
keywords = "113 Computer and information sciences, 6121 Languages",
author = "Kettunen, {Kimmo Tapio} and P{\"a}{\"a}kk{\"o}nen, {Tuula Anneli} and Koistinen, {Jani Mika Olavi}",
note = "Volume: Proceeding volume:",
year = "2016",
doi = "10.3233/978-1-61499-701-6-122",
language = "English",
isbn = "978-1-61499-700-9",
series = "Frontiers in Artificial Intelligence and Applications",
publisher = "IOS PRESS",
number = "289",
pages = "122--129",
booktitle = "Human Language Technologies – The Baltic Perspective",
address = "Netherlands",

}

Kettunen, KT, Pääkkönen, TA & Koistinen, JMO 2016, Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. in Human Language Technologies – The Baltic Perspective: Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016). Frontiers in Artificial Intelligence and Applications, no. 289, IOS PRESS, Amsterdam, pp. 122-129, International Conference Human Language Technologies : The Baltic Perspective, Riika, Latvia, 06/10/2016. https://doi.org/10.3233/978-1-61499-701-6-122

Between Diachrony and Synchrony : Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. / Kettunen, Kimmo Tapio; Pääkkönen, Tuula Anneli; Koistinen, Jani Mika Olavi.

Human Language Technologies – The Baltic Perspective: Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016). Amsterdam : IOS PRESS, 2016. p. 122-129 (Frontiers in Artificial Intelligence and Applications; No. 289).

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Between Diachrony and Synchrony

T2 - Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers

AU - Kettunen, Kimmo Tapio

AU - Pääkkönen, Tuula Anneli

AU - Koistinen, Jani Mika Olavi

N1 - Volume: Proceeding volume:

PY - 2016

Y1 - 2016

N2 - The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year [3]. The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4, 5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.

AB - The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year [3]. The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4, 5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.

KW - 113 Computer and information sciences

KW - 6121 Languages

U2 - 10.3233/978-1-61499-701-6-122

DO - 10.3233/978-1-61499-701-6-122

M3 - Conference contribution

SN - 978-1-61499-700-9

T3 - Frontiers in Artificial Intelligence and Applications

SP - 122

EP - 129

BT - Human Language Technologies – The Baltic Perspective

PB - IOS PRESS

CY - Amsterdam

ER -

Kettunen KT, Pääkkönen TA, Koistinen JMO. Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. In Human Language Technologies – The Baltic Perspective: Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016). Amsterdam: IOS PRESS. 2016. p. 122-129. (Frontiers in Artificial Intelligence and Applications; 289). https://doi.org/10.3233/978-1-61499-701-6-122