Aalto University publication series

Language- and domain- independent text mining
Mari-Sanna Paukkeri
Doctoral dissertation for the degree of Doctor of Science in

Technology to be presented with due permission of the Aalto University School of Science, for public examination and debate in Auditorium AS1 of the school on 9th November 2012 at 12 noon.

Aalto University
School of Science
Department of Information and Computer Science

Supervising professor
Prof. Erkki Oja

Thesis advisors [term used in Aalto University]
Doc. Timo Honkela
Dr. Mathias Creutz

Preliminary examiners
Dr. Reinhard Rapp, Johannes Gutenberg University Mainz, Germany
Dr. Roman Yangarber, University of Helsinki, Finland

Doc. Jussi Karlgren, Gavagai AB, Sweden

The field of natural language processing (NLP) has developed enormously during the last decades. The availability of constantly increasing amount of textual data in electronic form has accelerated also the development of statistical methods for NLP, in which characteristics of natural languages are learned from large corpora. Statistical methods have shown their applicability in information retrieval, in which documents of various languages and domains are returned according to user queries, statistical machine translation which is easily applicable to new languages, document clustering to group semantically similar documents, and many information extraction tasks, including keyphrase extraction, document summarization and discovering linguistic features. However, a majority of the NLP research, including also many statistical methods, is concentrated on the English language, using various language-specific tools and resources, such as part-of-speech taggers and ontologies, which are not directly applicable to other languages. ...

