Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

Sammanfattning

The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.39 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of this material is also available freely downloadable in The Language Bank of Finland provided by the Fin-CLARIN consortium. The collection can also be accessed through the Korp environment that has been developed by Språkbanken at the University of Gothenburg and extended by FIN-CLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield-style information retrieval test collection has been produced out of a small part of the Digi newspaper material at the University of Tampere (Järvelin et al., 2015). The quality of the OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections. There is no single available method to assess the quality of large collections, but different methods can be used to approximate the quality. This paper discusses different corpus analysis style ways to approximate the overall lexical quality of the Finnish part of the Digi collection.
Originalspråkengelska
Titel på gästpublikationProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
Antal sidor6
FörlagEuropean Language Resources Association (ELRA)
Utgivningsdatummaj 2016
Sidor956-961
ISBN (tryckt)978-2-9517408-9-1
StatusPublicerad - maj 2016
MoE-publikationstypA4 Artikel i en konferenspublikation
EvenemangWorkshop on language engineering for online reputation management - Portoroz, Slovenien
Varaktighet: 25 maj 201627 maj 2016
Konferensnummer: 10 (LREC)

Vetenskapsgrenar

  • 6121 Språkvetenskaper
  • 113 Data- och informationsvetenskap

Citera det här

Kettunen, K. T., & Pääkkönen, T. A. (2016). Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. I Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (s. 956-961). European Language Resources Association (ELRA).
Kettunen, Kimmo Tapio ; Pääkkönen, Tuula Anneli. / Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), 2016. s. 956-961
@inproceedings{a40e3112127f48bd9043b6da8529f741,
title = "Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means",
abstract = "The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.39 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of this material is also available freely downloadable in The Language Bank of Finland provided by the Fin-CLARIN consortium. The collection can also be accessed through the Korp environment that has been developed by Spr{\aa}kbanken at the University of Gothenburg and extended by FIN-CLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield-style information retrieval test collection has been produced out of a small part of the Digi newspaper material at the University of Tampere (J{\"a}rvelin et al., 2015). The quality of the OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections. There is no single available method to assess the quality of large collections, but different methods can be used to approximate the quality. This paper discusses different corpus analysis style ways to approximate the overall lexical quality of the Finnish part of the Digi collection.",
keywords = "6121 Languages, 113 Computer and information sciences",
author = "Kettunen, {Kimmo Tapio} and P{\"a}{\"a}kk{\"o}nen, {Tuula Anneli}",
note = "Volume: Proceeding volume:",
year = "2016",
month = "5",
language = "English",
isbn = "978-2-9517408-9-1",
pages = "956--961",
booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)",
publisher = "European Language Resources Association (ELRA)",
address = "International",

}

Kettunen, KT & Pääkkönen, TA 2016, Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. i Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), s. 956-961, Workshop on language engineering for online reputation management, Portoroz, Slovenien, 25/05/2016.

Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. / Kettunen, Kimmo Tapio; Pääkkönen, Tuula Anneli.

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), 2016. s. 956-961.

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragVetenskapligPeer review

TY - GEN

T1 - Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means

AU - Kettunen, Kimmo Tapio

AU - Pääkkönen, Tuula Anneli

N1 - Volume: Proceeding volume:

PY - 2016/5

Y1 - 2016/5

N2 - The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.39 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of this material is also available freely downloadable in The Language Bank of Finland provided by the Fin-CLARIN consortium. The collection can also be accessed through the Korp environment that has been developed by Språkbanken at the University of Gothenburg and extended by FIN-CLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield-style information retrieval test collection has been produced out of a small part of the Digi newspaper material at the University of Tampere (Järvelin et al., 2015). The quality of the OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections. There is no single available method to assess the quality of large collections, but different methods can be used to approximate the quality. This paper discusses different corpus analysis style ways to approximate the overall lexical quality of the Finnish part of the Digi collection.

AB - The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.39 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of this material is also available freely downloadable in The Language Bank of Finland provided by the Fin-CLARIN consortium. The collection can also be accessed through the Korp environment that has been developed by Språkbanken at the University of Gothenburg and extended by FIN-CLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield-style information retrieval test collection has been produced out of a small part of the Digi newspaper material at the University of Tampere (Järvelin et al., 2015). The quality of the OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections. There is no single available method to assess the quality of large collections, but different methods can be used to approximate the quality. This paper discusses different corpus analysis style ways to approximate the overall lexical quality of the Finnish part of the Digi collection.

KW - 6121 Languages

KW - 113 Computer and information sciences

UR - http://www.lrec-conf.org/proceedings/lrec2016/index.html

M3 - Conference contribution

SN - 978-2-9517408-9-1

SP - 956

EP - 961

BT - Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

PB - European Language Resources Association (ELRA)

ER -

Kettunen KT, Pääkkönen TA. Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. I Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA). 2016. s. 956-961