Between Diachrony and Synchrony

Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year [3]. The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4, 5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.
Originalspråkengelska
Titel på gästpublikationHuman Language Technologies – The Baltic Perspective : Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016)
Antal sidor8
UtgivningsortAmsterdam
FörlagIOS PRESS
Utgivningsdatum2016
Sidor122-129
ISBN (tryckt)978-1-61499-700-9
ISBN (elektroniskt)978-1-61499-701-6
DOI
StatusPublicerad - 2016
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangInternational Conference Human Language Technologies : The Baltic Perspective - Riika, Lettland
Varaktighet: 6 okt 20167 okt 2016
Konferensnummer: 7

Publikationsserier

NamnFrontiers in Artificial Intelligence and Applications
FörlagIOS Press
Nummer289
ISSN (tryckt)0922-6389
ISSN (elektroniskt)1879-8314

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap
  • 6121 Språkvetenskaper

Citera det här

Kettunen, K. T., Pääkkönen, T. A., & Koistinen, J. M. O. (2016). Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. I Human Language Technologies – The Baltic Perspective: Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016) (s. 122-129). (Frontiers in Artificial Intelligence and Applications; Nr. 289). Amsterdam: IOS PRESS. https://doi.org/10.3233/978-1-61499-701-6-122
Kettunen, Kimmo Tapio ; Pääkkönen, Tuula Anneli ; Koistinen, Jani Mika Olavi. / Between Diachrony and Synchrony : Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. Human Language Technologies – The Baltic Perspective: Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016). Amsterdam : IOS PRESS, 2016. s. 122-129 (Frontiers in Artificial Intelligence and Applications; 289).
@inproceedings{8827f3dcefcd4c618d1933b47d6dfc59,
title = "Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers",
abstract = "The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year [3]. The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4, 5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.",
keywords = "113 Computer and information sciences, 6121 Languages",
author = "Kettunen, {Kimmo Tapio} and P{\"a}{\"a}kk{\"o}nen, {Tuula Anneli} and Koistinen, {Jani Mika Olavi}",
note = "Volume: Proceeding volume:",
year = "2016",
doi = "10.3233/978-1-61499-701-6-122",
language = "English",
isbn = "978-1-61499-700-9",
series = "Frontiers in Artificial Intelligence and Applications",
publisher = "IOS PRESS",
number = "289",
pages = "122--129",
booktitle = "Human Language Technologies – The Baltic Perspective",
address = "Netherlands",

}

Kettunen, KT, Pääkkönen, TA & Koistinen, JMO 2016, Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. i Human Language Technologies – The Baltic Perspective: Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016). Frontiers in Artificial Intelligence and Applications, nr. 289, IOS PRESS, Amsterdam, s. 122-129, International Conference Human Language Technologies : The Baltic Perspective, Riika, Lettland, 06/10/2016. https://doi.org/10.3233/978-1-61499-701-6-122

Between Diachrony and Synchrony : Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. / Kettunen, Kimmo Tapio; Pääkkönen, Tuula Anneli; Koistinen, Jani Mika Olavi.

Human Language Technologies – The Baltic Perspective: Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016). Amsterdam : IOS PRESS, 2016. s. 122-129 (Frontiers in Artificial Intelligence and Applications; Nr. 289).

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

TY - GEN

T1 - Between Diachrony and Synchrony

T2 - Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers

AU - Kettunen, Kimmo Tapio

AU - Pääkkönen, Tuula Anneli

AU - Koistinen, Jani Mika Olavi

N1 - Volume: Proceeding volume:

PY - 2016

Y1 - 2016

N2 - The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year [3]. The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4, 5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.

AB - The National Library of Finland has digitized the historical newspapers and journals published in Finland between 1771 and 1910 [1,2]. The size of the whole collection up to 1910 is about 3.1 M pages. The newspaper collection contains approximately 1.961 million pages mostly in Finnish and Swedish. Finnish part of the collection consists of about 1 063 648 pages, and Swedish part of 892 101 pages. Additionally there are 11 548 pages in German and Russian. Finnish part of the collection has about 2.407 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. An open data delivery package of the whole text material has been produced recently and it will be made publicly available later this year [3]. The quality of OCRed collections is an important topic in digital humanities, as it affects general usability, searchability and advanced processing, such as content mining, of collections [4, 5]. There is no single available method to assess the quality of large collections, but different methods can be used to approximate quality. This paper uses corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data of the same era. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of methods that build up a compact procedure for quality assessment after e.g. re-OCRing or post-correction of the material.

KW - 113 Computer and information sciences

KW - 6121 Languages

U2 - 10.3233/978-1-61499-701-6-122

DO - 10.3233/978-1-61499-701-6-122

M3 - Conference contribution

SN - 978-1-61499-700-9

T3 - Frontiers in Artificial Intelligence and Applications

SP - 122

EP - 129

BT - Human Language Technologies – The Baltic Perspective

PB - IOS PRESS

CY - Amsterdam

ER -

Kettunen KT, Pääkkönen TA, Koistinen JMO. Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. I Human Language Technologies – The Baltic Perspective: Proceedings of the 7th International Conference: Human Language Technologies – The Baltic Perspective (Baltic HLT 2016). Amsterdam: IOS PRESS. 2016. s. 122-129. (Frontiers in Artificial Intelligence and Applications; 289). https://doi.org/10.3233/978-1-61499-701-6-122