Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission

Simon Hengchen, Mathias Coeckelbergs, Seth van Hooland, Ruben Verborgh, Thomas Steiner

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

Kuvaus

Topic Modelling (TM) has gained momentum over the last few years within the humanities to analyze topics represented in large volumes of full text. This paper proposes an experiment with the usage of TM based on a large subset of digitized archival holdings of the European Commission (EC). Currently, millions of scanned and OCRed files are available and hold the potential to significantly change the way historians of the construction and evolution of the European Union can perform their research. However, due to a lack of resources, only minimal metadata are available on a file and document level, seriously undermining the accessibility of this archival collection. The article explores in an empirical manner the possibilities and limits of TM to automatically extract key concepts from a large body of documents spanning multiple decades. By mapping the topics to headings of the EUROVOC thesaurus, the proof of concept described in this paper offers the future possibility to represent the identified topics with the help of a hierarchical search interface for end-users.
Alkuperäiskielienglanti
Otsikko2016 IEEE International Conference on Big Data
JulkaisupaikkaWashington, DC, USA
KustantajaIEEE
Julkaisupäivä5 joulukuuta 2016
ISBN (painettu)978-1-4673-9006-4
ISBN (elektroninen)978-1-4673-9005-7
TilaJulkaistu - 5 joulukuuta 2016
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa

Tieteenalat

  • 6121 Kielitieteet
  • 113 Tietojenkäsittely- ja informaatiotieteet

Lainaa tätä

Hengchen, S., Coeckelbergs, M., van Hooland, S., Verborgh, R., & Steiner, T. (2016). Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission. teoksessa 2016 IEEE International Conference on Big Data Washington, DC, USA: IEEE.
Hengchen, Simon ; Coeckelbergs, Mathias ; van Hooland, Seth ; Verborgh, Ruben ; Steiner, Thomas. / Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission. 2016 IEEE International Conference on Big Data. Washington, DC, USA : IEEE, 2016.
@inproceedings{5170113f09024c67aae3bdfc4f4c15b5,
title = "Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission",
abstract = "Topic Modelling (TM) has gained momentum over the last few years within the humanities to analyze topics represented in large volumes of full text. This paper proposes an experiment with the usage of TM based on a large subset of digitized archival holdings of the European Commission (EC). Currently, millions of scanned and OCRed files are available and hold the potential to significantly change the way historians of the construction and evolution of the European Union can perform their research. However, due to a lack of resources, only minimal metadata are available on a file and document level, seriously undermining the accessibility of this archival collection. The article explores in an empirical manner the possibilities and limits of TM to automatically extract key concepts from a large body of documents spanning multiple decades. By mapping the topics to headings of the EUROVOC thesaurus, the proof of concept described in this paper offers the future possibility to represent the identified topics with the help of a hierarchical search interface for end-users.",
keywords = "6121 Languages, 113 Computer and information sciences, library science, archival science",
author = "Simon Hengchen and Mathias Coeckelbergs and {van Hooland}, Seth and Ruben Verborgh and Thomas Steiner",
year = "2016",
month = "12",
day = "5",
language = "English",
isbn = "978-1-4673-9006-4",
booktitle = "2016 IEEE International Conference on Big Data",
publisher = "IEEE",
address = "International",

}

Hengchen, S, Coeckelbergs, M, van Hooland, S, Verborgh, R & Steiner, T 2016, Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission. julkaisussa 2016 IEEE International Conference on Big Data. IEEE, Washington, DC, USA.

Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission. / Hengchen, Simon; Coeckelbergs, Mathias; van Hooland, Seth; Verborgh, Ruben; Steiner, Thomas.

2016 IEEE International Conference on Big Data. Washington, DC, USA : IEEE, 2016.

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu

TY - GEN

T1 - Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission

AU - Hengchen, Simon

AU - Coeckelbergs, Mathias

AU - van Hooland, Seth

AU - Verborgh, Ruben

AU - Steiner, Thomas

PY - 2016/12/5

Y1 - 2016/12/5

N2 - Topic Modelling (TM) has gained momentum over the last few years within the humanities to analyze topics represented in large volumes of full text. This paper proposes an experiment with the usage of TM based on a large subset of digitized archival holdings of the European Commission (EC). Currently, millions of scanned and OCRed files are available and hold the potential to significantly change the way historians of the construction and evolution of the European Union can perform their research. However, due to a lack of resources, only minimal metadata are available on a file and document level, seriously undermining the accessibility of this archival collection. The article explores in an empirical manner the possibilities and limits of TM to automatically extract key concepts from a large body of documents spanning multiple decades. By mapping the topics to headings of the EUROVOC thesaurus, the proof of concept described in this paper offers the future possibility to represent the identified topics with the help of a hierarchical search interface for end-users.

AB - Topic Modelling (TM) has gained momentum over the last few years within the humanities to analyze topics represented in large volumes of full text. This paper proposes an experiment with the usage of TM based on a large subset of digitized archival holdings of the European Commission (EC). Currently, millions of scanned and OCRed files are available and hold the potential to significantly change the way historians of the construction and evolution of the European Union can perform their research. However, due to a lack of resources, only minimal metadata are available on a file and document level, seriously undermining the accessibility of this archival collection. The article explores in an empirical manner the possibilities and limits of TM to automatically extract key concepts from a large body of documents spanning multiple decades. By mapping the topics to headings of the EUROVOC thesaurus, the proof of concept described in this paper offers the future possibility to represent the identified topics with the help of a hierarchical search interface for end-users.

KW - 6121 Languages

KW - 113 Computer and information sciences

KW - library science

KW - archival science

M3 - Conference contribution

SN - 978-1-4673-9006-4

BT - 2016 IEEE International Conference on Big Data

PB - IEEE

CY - Washington, DC, USA

ER -

Hengchen S, Coeckelbergs M, van Hooland S, Verborgh R, Steiner T. Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission. julkaisussa 2016 IEEE International Conference on Big Data. Washington, DC, USA: IEEE. 2016