Projects per year
Abstract
We present a new method for unsupervised learning of multilingual symbol (e.g. character) embeddings, without any parallel data or prior knowledge about correspondences between languages. It is able to exploit similarities across languages between the distributions over symbols' contexts of use within their language, even in the absence of any symbols in common to the two languages. In experiments with an artificially corrupted text corpus, we show that the method can retrieve character correspondences obscured by noise. We then present encouraging results of applying the method to real linguistic data, including for low-resourced languages. The learned representations open the possibility of fully unsupervised comparative studies of text or speech corpora in low-resourced languages with no prior knowledge regarding their symbol sets.
Original language | English |
---|---|
Title of host publication | Second Annual Meeting of the Society for Computation in Linguistics (SCiL 2019) |
Number of pages | 10 |
Publisher | The Association for Computational Linguistics |
Publication date | 3 Jan 2019 |
Pages | 19-28 |
Article number | 4 |
ISBN (Electronic) | 978-1-5108-7753-5 |
DOIs | |
Publication status | Published - 3 Jan 2019 |
MoE publication type | A4 Article in conference proceedings |
Event | Society for Computation in Linguistics - New York, United States Duration: 3 Jan 2019 → 6 Jan 2019 Conference number: 2 |
Fields of Science
- 113 Computer and information sciences
- 6121 Languages
Projects
- 1 Finished
-
DLT: Digital language typology: mining from the surface to the core
Vainio, M. (Principal Investigator), Toivonen, H. T. (Principal Investigator), Granroth-Wilding, M. (Project manager) & Hinkka, A. E. (Participant)
01/01/2016 → 31/12/2019
Project: Research project