Recognising Intertextuality in the Digital Corpus of Finnic Oral Poetry: Experiment with the Sampo Cycle

Tutkimustuotos: Artikkeli kirjassa/raportissa/konferenssijulkaisussaKonferenssiartikkeliTieteellinenvertaisarvioitu


While digital corpora have enabled new perspectives into the variation and continuums of human communication, they often pose problems related to implicit biases of the data and the limited reach of current methods in recognising similarity in linguistically complex data, especially in small languages. The digital corpus of historical Finnic oral poetry in alliterative tetrametre is characterised by significant poetic, linguistic and orthographic variation. At the extreme, a word may be written in hundreds of different ways. The current corpus comprises 189,189 poetic texts in six Finnic languages (Karelian, Ingrian, Votic, Estonian, Seto and Finnish) recorded in 1564–1957 by 5,287 recorders. It has a long curation history and significant bias towards some genres, poetic forms and regions that collectors have preferred. In this poetic tradition, an idea is typically expressed with several parallel, partly alternative poetic lines or motifs, and similar verse types may be used in different contexts. A manual attempt to find all the occurrences of widely used expressions or motifs in the corpus is an unattainable task. While the digital tools—starting from simple queries to more advanced methods—make it possible to aim at wider intertextual analyses, some part of relevant material is typically not reached. Thus, it becomes central to estimate the amount and quality of the relevant data that is not recognised with different methods. Here, we discuss two strategies for mapping intertextuality in the corpus: 1) proceeding with text queries and 2) recognising similar poetic lines computationally, based on string similarity. We compare these approaches with one another, and then proceed to compare the results they yield with the existing type index and the results of manual early 20th-century research. While the methodological and theoretical foundations of this type of research no longer hold, and while our further interest lies in the intertextuality and variation rather than in the problematic concept of poem types, parts of earlier analyses may be used in evaluating the performance of digital approaches.
OtsikkoProceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022)
ToimittajatKarl Berglund, Matti La Mela, Inge Zwart
TilaJulkaistu - 2022
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaDigital Humanities in the Nordic and Baltic Countries 6th Conference - Uppsala, Ruotsi
Kesto: 15 maalisk. 202218 maalisk. 2022
Konferenssinumero: 6


NimiCEUR Workshop Proceedings
ISSN (elektroninen)1613-0073


  • 6122 Kirjallisuuden tutkimus
  • 113 Tietojenkäsittely- ja informaatiotieteet

Siteeraa tätä