Information Retrieval From Historical Newspaper Collections in Highly Inflectional Languages: A Query Expansion Approach

Anni Järvelin, Heikki Keskustalo, Eero Sormunen, Miamaria Saastamoinen, Kimmo Tapio Kettunen

Research output: Contribution to journalArticleScientificpeer-review

Abstract

The aim of the study was to test whether query expansion
by approximate string matching methods is beneficial
in retrieval from historical newspaper collections in
a language rich with compounds and inflectional forms
(Finnish). First, approximate string matching methods
were used to generate lists of index words most similar
to contemporary query terms in a digitized newspaper
collection from the 1800s. Top index word variants were
categorized to estimate the appropriate query expansion
ranges in the retrieval test. Second, the effectiveness of
approximate string matching methods, automatically
generated inflectional forms, and their combinations
were measured in a Cranfield-style test. Finally, a
detailed topic-level analysis of test results was conducted.
In the index of historical newspaper collection
the occurrences of a word typically spread to many
linguistic and historical variants along with optical character
recognition (OCR) errors. All query expansion
methods improved the baseline results. Extensive
expansion of around 30 variants for each query word
was required to achieve the highest performance
improvement. Query expansion based on approximate
string matching was superior to using the inflectional
forms of the query words, showing that coverage of the
different types of variation is more important than precision
in handling one type of variation.
Original languageEnglish
JournalJournal of the Association for Information Science and Technology
Volume67
Issue number12
Pages (from-to)2928–2946
Number of pages18
ISSN2330-1635
DOIs
Publication statusPublished - Nov 2016
MoE publication typeA1 Journal article-refereed

Fields of Science

  • 113 Computer and information sciences

Cite this