Exporting Finnish Digitized Historical Newspaper Contents for Offline Use

Tuula Anneli Pääkkönen, Jukka Kervinen, Asko Nivala, Kimmo Tapio Kettunen, Eetu Mäkelä

Research output: Contribution to journalArticleScientificpeer-review

Abstract

Digital collections of the National Library of Finland (NLF) contain over 10 million pages of historical newspapers, journals and some technical ephemera. The material ranges from the early Finnish newspapers from 1771 until the present day. The material up to 1910 can be viewed in the public web service, where as anything later is available at the six legal deposit libraries in Finland. A recent user study noticed that a different type of researcher use is one of the key uses of the collection. National Library of Finland has gotten several requests to provide the content of the digital collections as one offline bundle, where all the needed content is included. For this purpose we introduced a new format, which contains three different information sets: the full metadata of a publication page, the actual page content as ALTO XML, and the raw text content. We consider these formats most useful to be provided as raw data for the researchers. In this paper we will describe how the export format was created, how other parties have packaged the same data and what the benefits are of the current approach. We shall also briefly discuss word level quality of the content and show a real research scenario for the data.
Original languageEnglish
JournalD-Lib Magazine
Volume22
Issue number7/8 2016
ISSN1082-9873
DOIs
Publication statusPublished - Jul 2016
MoE publication typeA1 Journal article-refereed

Fields of Science

  • 518 Media and communications

Cite this

@article{dd93a8667afb449195740d4b27f6489f,
title = "Exporting Finnish Digitized Historical Newspaper Contents for Offline Use",
abstract = "Digital collections of the National Library of Finland (NLF) contain over 10 million pages of historical newspapers, journals and some technical ephemera. The material ranges from the early Finnish newspapers from 1771 until the present day. The material up to 1910 can be viewed in the public web service, where as anything later is available at the six legal deposit libraries in Finland. A recent user study noticed that a different type of researcher use is one of the key uses of the collection. National Library of Finland has gotten several requests to provide the content of the digital collections as one offline bundle, where all the needed content is included. For this purpose we introduced a new format, which contains three different information sets: the full metadata of a publication page, the actual page content as ALTO XML, and the raw text content. We consider these formats most useful to be provided as raw data for the researchers. In this paper we will describe how the export format was created, how other parties have packaged the same data and what the benefits are of the current approach. We shall also briefly discuss word level quality of the content and show a real research scenario for the data.",
keywords = "518 Media and communications",
author = "P{\"a}{\"a}kk{\"o}nen, {Tuula Anneli} and Jukka Kervinen and Asko Nivala and Kettunen, {Kimmo Tapio} and Eetu M{\"a}kel{\"a}",
year = "2016",
month = "7",
doi = "10.1045/july2016-paakkonen",
language = "English",
volume = "22",
journal = "D-Lib Magazine",
issn = "1082-9873",
publisher = "Corp. for National Research Initiatives",
number = "7/8 2016",

}

Exporting Finnish Digitized Historical Newspaper Contents for Offline Use. / Pääkkönen, Tuula Anneli; Kervinen, Jukka; Nivala, Asko; Kettunen, Kimmo Tapio; Mäkelä, Eetu.

In: D-Lib Magazine, Vol. 22, No. 7/8 2016, 07.2016.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - Exporting Finnish Digitized Historical Newspaper Contents for Offline Use

AU - Pääkkönen, Tuula Anneli

AU - Kervinen, Jukka

AU - Nivala, Asko

AU - Kettunen, Kimmo Tapio

AU - Mäkelä, Eetu

PY - 2016/7

Y1 - 2016/7

N2 - Digital collections of the National Library of Finland (NLF) contain over 10 million pages of historical newspapers, journals and some technical ephemera. The material ranges from the early Finnish newspapers from 1771 until the present day. The material up to 1910 can be viewed in the public web service, where as anything later is available at the six legal deposit libraries in Finland. A recent user study noticed that a different type of researcher use is one of the key uses of the collection. National Library of Finland has gotten several requests to provide the content of the digital collections as one offline bundle, where all the needed content is included. For this purpose we introduced a new format, which contains three different information sets: the full metadata of a publication page, the actual page content as ALTO XML, and the raw text content. We consider these formats most useful to be provided as raw data for the researchers. In this paper we will describe how the export format was created, how other parties have packaged the same data and what the benefits are of the current approach. We shall also briefly discuss word level quality of the content and show a real research scenario for the data.

AB - Digital collections of the National Library of Finland (NLF) contain over 10 million pages of historical newspapers, journals and some technical ephemera. The material ranges from the early Finnish newspapers from 1771 until the present day. The material up to 1910 can be viewed in the public web service, where as anything later is available at the six legal deposit libraries in Finland. A recent user study noticed that a different type of researcher use is one of the key uses of the collection. National Library of Finland has gotten several requests to provide the content of the digital collections as one offline bundle, where all the needed content is included. For this purpose we introduced a new format, which contains three different information sets: the full metadata of a publication page, the actual page content as ALTO XML, and the raw text content. We consider these formats most useful to be provided as raw data for the researchers. In this paper we will describe how the export format was created, how other parties have packaged the same data and what the benefits are of the current approach. We shall also briefly discuss word level quality of the content and show a real research scenario for the data.

KW - 518 Media and communications

UR - http://www.dlib.org/dlib/july16/paakkonen/07paakkonen.html

U2 - 10.1045/july2016-paakkonen

DO - 10.1045/july2016-paakkonen

M3 - Article

VL - 22

JO - D-Lib Magazine

JF - D-Lib Magazine

SN - 1082-9873

IS - 7/8 2016

ER -