Abstract
We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on other datasets, without reaching the same level of performance, because of domain mismatch between training and test data.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2018 EMNLP Workshop W-NUT : The 4th Workshop on Noisy User-generated Text |
Editors | Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi |
Number of pages | 10 |
Place of Publication | Stroudsburg |
Publisher | The Association for Computational Linguistics |
Publication date | 1 Nov 2018 |
Pages | 64-73 |
ISBN (Electronic) | 978-1-948087-79-7 |
Publication status | Published - 1 Nov 2018 |
MoE publication type | A4 Article in conference proceedings |
Event | Workshop on Noisy User-generated Text - Brussels, Belgium Duration: 1 Nov 2018 → 1 Nov 2018 Conference number: 4 |
Fields of Science
- 113 Computer and information sciences
- 6121 Languages