Msimulizi as corpus for accurate search

Arvi Hurskainen

Research output: Working paperScientific


In Technical Report 602, I described the process of converting printed text into machine-readable form. This report is an extension to it, and here I will go into more detail in describing and demonsrating the capabilities of the search system based on analysed text. All the material on Msimulizi (years 1888-1896) that is available on SOAS web page was processed into machine-readable form, including manual editing of the whole text. The second round of editing was done on the basis of computational analysis, which points out the remaining scanning mistakes. The clean text was then converted into an analysed format, which is optimal for information retrieval. The report demonstrates especially such search tasks, which are hardly possible using conventional string search, due to the complex word structure of Swahili.
Original languageEnglish
Place of PublicationHelsinki
PublisherUniversity of Helsinki, Institute for Asian and African Studies
Number of pages20
Publication statusPublished - Oct 2020
MoE publication typeD4 Published development or research report or study

Fields of Science

  • 6121 Languages
  • 113 Computer and information sciences

Cite this