Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods

Kimmo Kettunen, Timo Honkela, Krister Linden, Pekka Kauppinen, Tuula Pääkkönen, Jukka Kervinen

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowd-sourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.
Original languageEnglish
Title of host publicationIFLA World Library and Information Congress Proceedings : 80th IFLA General Conference and Assembly
Number of pages23
Place of PublicationLyon, France
PublisherIFLA
Publication date16 Aug 2014
Publication statusPublished - 16 Aug 2014
MoE publication typeA4 Article in conference proceedings
EventIFLA World Library and Information Congress - Lyon, France
Duration: 16 Aug 201422 Aug 2014
Conference number: 80

Fields of Science

  • 6121 Languages

Cite this

Kettunen, K., Honkela, T., Linden, K., Kauppinen, P., Pääkkönen, T., & Kervinen, J. (2014). Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. In IFLA World Library and Information Congress Proceedings: 80th IFLA General Conference and Assembly Lyon, France: IFLA.
Kettunen, Kimmo ; Honkela, Timo ; Linden, Krister ; Kauppinen, Pekka ; Pääkkönen, Tuula ; Kervinen, Jukka. / Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. IFLA World Library and Information Congress Proceedings: 80th IFLA General Conference and Assembly. Lyon, France : IFLA, 2014.
@inproceedings{b94cf653eb0b45cdb350977fea3f443f,
title = "Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods",
abstract = "In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowd-sourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.",
keywords = "6121 Languages",
author = "Kimmo Kettunen and Timo Honkela and Krister Linden and Pekka Kauppinen and Tuula P{\"a}{\"a}kk{\"o}nen and Jukka Kervinen",
note = "Volume: Proceeding volume:",
year = "2014",
month = "8",
day = "16",
language = "English",
booktitle = "IFLA World Library and Information Congress Proceedings",
publisher = "IFLA",
address = "International",

}

Kettunen, K, Honkela, T, Linden, K, Kauppinen, P, Pääkkönen, T & Kervinen, J 2014, Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. in IFLA World Library and Information Congress Proceedings: 80th IFLA General Conference and Assembly. IFLA, Lyon, France, IFLA World Library and Information Congress, Lyon, France, 16/08/2014.

Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. / Kettunen, Kimmo; Honkela, Timo; Linden, Krister; Kauppinen, Pekka; Pääkkönen, Tuula; Kervinen, Jukka.

IFLA World Library and Information Congress Proceedings: 80th IFLA General Conference and Assembly. Lyon, France : IFLA, 2014.

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods

AU - Kettunen, Kimmo

AU - Honkela, Timo

AU - Linden, Krister

AU - Kauppinen, Pekka

AU - Pääkkönen, Tuula

AU - Kervinen, Jukka

N1 - Volume: Proceeding volume:

PY - 2014/8/16

Y1 - 2014/8/16

N2 - In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowd-sourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.

AB - In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowd-sourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.

KW - 6121 Languages

M3 - Conference contribution

BT - IFLA World Library and Information Congress Proceedings

PB - IFLA

CY - Lyon, France

ER -

Kettunen K, Honkela T, Linden K, Kauppinen P, Pääkkönen T, Kervinen J. Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. In IFLA World Library and Information Congress Proceedings: 80th IFLA General Conference and Assembly. Lyon, France: IFLA. 2014