Transliteration Model for Egyptian Words

Aktiviteetti: Puhe- tai esitystyypitSuullinen esitys

Kuvaus

When Egyptologists interpret hieroglyphic texts, they transliterate them with Latin letters and diacritics. A transliteration of a hieroglyphic text in, for example, a plain text file is machine-readable. However, transliteration is always an interpretation of the text, and producing it is a slow endeavor that requires checking dictionaries and sign lists. Hence, the number of openly available machine-readable hieroglyphic texts as transliteration, or in any form, is low. Computer-assisted transliteration of hieroglyphic texts will speed up producing texts for digital studies, and there have been some attempts to develop such [1; 2]. Since there still is no working automatic transliteration, this is our future aim. The transliteration method under development is based on a back-off scheme, which at its core utilizes a language model of hieroglyphic words and their transliterations together with the observed relative frequencies of the pairs. In this paper, we describe the model and how we created it using an automatic alignment method we devised based on a widely used sequence alignment algorithm.
In order to create such a model for transliteration, a corpus of machine-readable Egyptian hieroglyphic texts with their transliterations is needed. Producing hieroglyphic text with computers is not trivial, as a small sign can be placed underneath another or, for example, nested within a bigger one. Egyptologists have since the 1970s been using special text editors to encode hieroglyphic texts so that the placement of the signs is maintained [3]. The most often used encoding in such editors is the so-called Manuel de Codage (MdC). Encoded hieroglyphic texts are machine-readable, but only the pictures of the texts produced are published [4]. We have identified two sources where encoded Egyptian hieroglyphs and their transliteration pairs are available. Thesaurus Linguae Aegyptiae (TLA) includes a collection of texts where c. 280,000 Egyptian words encoded in MdC have been aligned with their transliteration counterparts [5; 6]. The second, even more extensive, source is the Ramses Transliteration Corpus (RTC) with almost 500,000 MdC encoded words [7]. The RTC consists of encoded hieroglyphic sentences, each on its own line, and respective transliteration lines in another file. However, unlike the TLA, there is no ready alignment of the MdC and its transliteration on the word level.
Original hieroglyphic texts do not include word boundaries, but since the RTC data has been made available for word searches online, it contains, in addition to texts without word boundaries, also separate versions of the files where the encoded words have been separated with underscores. In order to find word-transliteration pairs, we align the sentences of encoded words with the respective transliterations. The alignment task is made more difficult by the fact that many of the texts contain damaged parts. In many places, there exists a possible transliteration for these damaged parts, whether individual signs or longer passages. These guesses have been marked in a variety of ways as the transliterations have been produced by numerous scholars. Mark-Jan Nederhof previously attempted to align hieroglyphic texts using a customized scoring system to give penalties to different readings [8]. Our alignment method uses the Needleman-Wunsch sequence alignment algorithm [9] together with a dictionary of MdC - transliteration pairs generated initially from the intact words within the TLA and completely intact lines in the RTC corpus.
After aligning the only partially intact lines of RTC, we extract the words from them and generate the MdC - Transliteration model with frequency information from all the words in both TLA and RTC. The model will be openly available as JSON files on our GitHub page. We intend to publish the scripts used in the alignment method at a later stage.

Bibliography

[1] Rosmorduc, S. 2008. Automated Transliteration of Egyptian Hieroglyphs. In Strudwick, N. (ed.) Information Technology and Egyptology in 2008, Proceedings of the meeting of the Computer Working Group of the International Association of Egyptologists. Bible in Technology, 2, Gorgias Press, 167–183.

[2] Rosmorduc, S. 2020. Automated Transliteration of Late Egyptian Using Neural Networks: An Experiment in “Deep Learning”. Lingua Aegyptia - Journal of Egyptian Language Studies, 28, 233–257.

[3] Rosmorduc, Serge. 2021. Digital Writing of Hieroglyphic Texts. In Gracia Zamacona, C. & Ortiz-García, J. (eds.), Handbook of Digital Egyptology: Texts. Monografías de Oriente Antiguo, 1. Editorial de Universidad de Alcalá.

[4] Nederhof, M.-J. 2015. OCR of Handwritten Transcriptions of Ancient Egyptian Hieroglyphic Text. In Proceedings of Altertumswissenschaften in a Digital Age: Egyptology, Papyrology and Beyond.

[5] Teilauszug der Datenbank des Vorhabens "Strukturen und Transformationen des Wortschatzes der ägyptischen Sprache" vom Januar 2018. Akademienvorhaben Strukturen und Transformationen des Wortschatzes der ägyptischen Sprache. Text- und Wissenskultur im alten Ägypten. 2018. urn:nbn:de:kobv:b4-opus4-29190.

[6] Schweitzer, S. 2021. AES - Ancient Egyptian Sentences; Corpus of Ancient Egyptian sentences for corpus-linguistic research. GitHub. https://github.com/simondschweitzer/aes.

[7] Rosmorduc, S. 2021. Ramses automated translitteration software. In Lingua Aegyptia (2021-06-15, Vol. 28, pp. 233–257). Zenodo. https://doi.org/10.5281/zenodo.4954597.

[8] Nederhof, M.-J. 2009, Automatic alignment of hieroglyphs and transliteration. In Strudwick, N. (ed.), Information Technology and Egyptology in 2008: Proceedings of the meeting of the Computer Working Group of the International Association of Egyptologists. Bible in Technology, 2, Gorgias Press, 71-92. http://hdl.handle.net/10023/1667.

[9] Needleman, S. B. and Wunsch, C. D. (1970). A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology, 48(3), 443–453.
Aikajakso8 maalisk. 2023
Tapahtuman otsikkoDigital Humanities in the Nordic and Baltic Countries
Tapahtuman tyyppiKonferenssi
Konferenssinumero7
SijaintiOslo/Stavanger/Bergen, NorjaNäytä kartalla
Tunnustuksen arvoKansainvälinen