Contrary to what has been done to date in the hybrid field of natural language processing (NLP), this doctoral thesis holds that the new approach developed below makes it possible to semi-automatically detect semantic changes in digitised, OCRed, historical corpora. We define the term semi-automatic as “making use of an advanced tool whilst remaining in control of key decisions regarding the processing of the corpus”. If the tool utilised – “topic modelling”, and more precisely the “Latent Dirichlet Allocation” (LDA) – is not unknown in NLP or computational historical semantics, where it is already mobilised to follow a priori selected words and try to detect when these words change meaning, it has never been used, to the best of our knowledge, to detect which words change in a humanistically-relevant way. In other terms, our method does not study a word in context to gather information on this specific word, but the whole context – which we consider a witness to a potential evolution of reality – to gather more contextual information on one or several particular semantic shift candidates. In order to detect these semantic changes, we use the algorithm to create lexical fields: groups of words that together define a subject to which they all relate. By comparing lexical fields over different time periods of the same corpus (that is, by mobilising a diachronic approach), we try to determine whether words appear over time. We support that if a word starts to be used in a certain context at a certain time, it is a likely candidate for semantic change. Of course, the method developed here and illustrated by a case study applies to a certain context: that of digitised, OCRed, historical archives in Dutch. Nevertheless, this doctoral work also describes the advantages and disadvantages of the algorithm and postulates, on the basis of this evaluation, that the method is applicable to other fields, under other conditions. By carrying out a critical evaluation of the tools available and used, this doctoral thesis invites the community to the reproducibility of the method, whilst pointing out obvious limitations of the approach and propositions on how to solve them.
|Tilldelningsdatum||6 dec 2017|
|Status||Publicerad - 6 dec 2017|
|MoE-publikationstyp||G4 Doktorsavhandling (monografi)|
- 6160 Övriga humanistiska vetenskaper
- 6121 Språkvetenskaper
- 113 Data- och informationsvetenskap