Information Retrieval From Historical Newspaper Collections in Highly Inflectional Languages: A Query Expansion Approach

Anni Järvelin, Heikki Keskustalo, Eero Sormunen, Miamaria Saastamoinen, Kimmo Tapio Kettunen

Research output: Contribution to journalArticleScientificpeer-review

Abstract

The aim of the study was to test whether query expansion
by approximate string matching methods is beneficial
in retrieval from historical newspaper collections in
a language rich with compounds and inflectional forms
(Finnish). First, approximate string matching methods
were used to generate lists of index words most similar
to contemporary query terms in a digitized newspaper
collection from the 1800s. Top index word variants were
categorized to estimate the appropriate query expansion
ranges in the retrieval test. Second, the effectiveness of
approximate string matching methods, automatically
generated inflectional forms, and their combinations
were measured in a Cranfield-style test. Finally, a
detailed topic-level analysis of test results was conducted.
In the index of historical newspaper collection
the occurrences of a word typically spread to many
linguistic and historical variants along with optical character
recognition (OCR) errors. All query expansion
methods improved the baseline results. Extensive
expansion of around 30 variants for each query word
was required to achieve the highest performance
improvement. Query expansion based on approximate
string matching was superior to using the inflectional
forms of the query words, showing that coverage of the
different types of variation is more important than precision
in handling one type of variation.
Original languageEnglish
JournalJournal of the Association for Information Science and Technology
Volume67
Issue number12
Pages (from-to)2928–2946
Number of pages18
ISSN2330-1635
DOIs
Publication statusPublished - Nov 2016
MoE publication typeA1 Journal article-refereed

Fields of Science

  • 113 Computer and information sciences

Cite this

@article{60d0069d2bec4af3a3623eef1e76cf4f,
title = "Information Retrieval From Historical Newspaper Collections in Highly Inflectional Languages: A Query Expansion Approach",
abstract = "The aim of the study was to test whether query expansionby approximate string matching methods is beneficialin retrieval from historical newspaper collections ina language rich with compounds and inflectional forms(Finnish). First, approximate string matching methodswere used to generate lists of index words most similarto contemporary query terms in a digitized newspapercollection from the 1800s. Top index word variants werecategorized to estimate the appropriate query expansionranges in the retrieval test. Second, the effectiveness ofapproximate string matching methods, automaticallygenerated inflectional forms, and their combinationswere measured in a Cranfield-style test. Finally, adetailed topic-level analysis of test results was conducted.In the index of historical newspaper collectionthe occurrences of a word typically spread to manylinguistic and historical variants along with optical characterrecognition (OCR) errors. All query expansionmethods improved the baseline results. Extensiveexpansion of around 30 variants for each query wordwas required to achieve the highest performanceimprovement. Query expansion based on approximatestring matching was superior to using the inflectionalforms of the query words, showing that coverage of thedifferent types of variation is more important than precisionin handling one type of variation.",
keywords = "113 Computer and information sciences",
author = "Anni J{\"a}rvelin and Heikki Keskustalo and Eero Sormunen and Miamaria Saastamoinen and Kettunen, {Kimmo Tapio}",
year = "2016",
month = "11",
doi = "10.1002/asi.23379",
language = "English",
volume = "67",
pages = "2928–2946",
journal = "Journal of the Association for Information Science and Technology",
issn = "2330-1635",
publisher = "John Wiley & Sons, Ltd",
number = "12",

}

Information Retrieval From Historical Newspaper Collections in Highly Inflectional Languages: A Query Expansion Approach. / Järvelin, Anni; Keskustalo, Heikki; Sormunen, Eero; Saastamoinen, Miamaria; Kettunen, Kimmo Tapio.

In: Journal of the Association for Information Science and Technology, Vol. 67, No. 12, 11.2016, p. 2928–2946.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - Information Retrieval From Historical Newspaper Collections in Highly Inflectional Languages: A Query Expansion Approach

AU - Järvelin, Anni

AU - Keskustalo, Heikki

AU - Sormunen, Eero

AU - Saastamoinen, Miamaria

AU - Kettunen, Kimmo Tapio

PY - 2016/11

Y1 - 2016/11

N2 - The aim of the study was to test whether query expansionby approximate string matching methods is beneficialin retrieval from historical newspaper collections ina language rich with compounds and inflectional forms(Finnish). First, approximate string matching methodswere used to generate lists of index words most similarto contemporary query terms in a digitized newspapercollection from the 1800s. Top index word variants werecategorized to estimate the appropriate query expansionranges in the retrieval test. Second, the effectiveness ofapproximate string matching methods, automaticallygenerated inflectional forms, and their combinationswere measured in a Cranfield-style test. Finally, adetailed topic-level analysis of test results was conducted.In the index of historical newspaper collectionthe occurrences of a word typically spread to manylinguistic and historical variants along with optical characterrecognition (OCR) errors. All query expansionmethods improved the baseline results. Extensiveexpansion of around 30 variants for each query wordwas required to achieve the highest performanceimprovement. Query expansion based on approximatestring matching was superior to using the inflectionalforms of the query words, showing that coverage of thedifferent types of variation is more important than precisionin handling one type of variation.

AB - The aim of the study was to test whether query expansionby approximate string matching methods is beneficialin retrieval from historical newspaper collections ina language rich with compounds and inflectional forms(Finnish). First, approximate string matching methodswere used to generate lists of index words most similarto contemporary query terms in a digitized newspapercollection from the 1800s. Top index word variants werecategorized to estimate the appropriate query expansionranges in the retrieval test. Second, the effectiveness ofapproximate string matching methods, automaticallygenerated inflectional forms, and their combinationswere measured in a Cranfield-style test. Finally, adetailed topic-level analysis of test results was conducted.In the index of historical newspaper collectionthe occurrences of a word typically spread to manylinguistic and historical variants along with optical characterrecognition (OCR) errors. All query expansionmethods improved the baseline results. Extensiveexpansion of around 30 variants for each query wordwas required to achieve the highest performanceimprovement. Query expansion based on approximatestring matching was superior to using the inflectionalforms of the query words, showing that coverage of thedifferent types of variation is more important than precisionin handling one type of variation.

KW - 113 Computer and information sciences

U2 - 10.1002/asi.23379

DO - 10.1002/asi.23379

M3 - Article

VL - 67

SP - 2928

EP - 2946

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

SN - 2330-1635

IS - 12

ER -