Language Set Identification in Noisy Synthetic Multilingual Documents

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an F 1-score of 97.6 when classifying between 44 languages.
Original languageEnglish
Title of host publicationComputational Linguistics and Intelligent Text Processing
EditorsA. Gelbukh
Number of pages11
VolumePart I
Publisher Springer International Publishing AG
Publication date2015
Pages633-643
ISBN (Print)978-3-319-18110-3
ISBN (Electronic)978-3-319-18111-0
DOIs
Publication statusPublished - 2015
MoE publication typeA4 Article in conference proceedings
EventInternational Conference on Intelligent Text Processing and Computational Linguistics - Kairo, Egypt
Duration: 14 Apr 201520 Apr 2015
Conference number: 16

Publication series

NameLecture Notes in Computer Science
Volume9041
ISSN (Print)0302-9743

Fields of Science

  • 6121 Languages
  • 113 Computer and information sciences

Cite this

Jauhiainen, T. S., Linden, K., & Jauhiainen, H. A. (2015). Language Set Identification in Noisy Synthetic Multilingual Documents. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing (Vol. Part I, pp. 633-643). (Lecture Notes in Computer Science; Vol. 9041). Springer International Publishing AG. https://doi.org/10.1007/978-3-319-18111-0_48
Jauhiainen, Tommi Sakari ; Linden, Krister ; Jauhiainen, Heidi Annika. / Language Set Identification in Noisy Synthetic Multilingual Documents. Computational Linguistics and Intelligent Text Processing. editor / A. Gelbukh. Vol. Part I Springer International Publishing AG, 2015. pp. 633-643 (Lecture Notes in Computer Science).
@inproceedings{4ee7998d9fe3445da16ede573a40b702,
title = "Language Set Identification in Noisy Synthetic Multilingual Documents",
abstract = "In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an F 1-score of 97.6 when classifying between 44 languages.",
keywords = "6121 Languages, 113 Computer and information sciences",
author = "Jauhiainen, {Tommi Sakari} and Krister Linden and Jauhiainen, {Heidi Annika}",
note = "Volume: Proceeding volume: Part I",
year = "2015",
doi = "10.1007/978-3-319-18111-0_48",
language = "English",
isbn = "978-3-319-18110-3",
volume = "Part I",
series = "Lecture Notes in Computer Science",
publisher = "Springer International Publishing AG",
pages = "633--643",
editor = "A. Gelbukh",
booktitle = "Computational Linguistics and Intelligent Text Processing",
address = "Switzerland",

}

Jauhiainen, TS, Linden, K & Jauhiainen, HA 2015, Language Set Identification in Noisy Synthetic Multilingual Documents. in A Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing. vol. Part I, Lecture Notes in Computer Science, vol. 9041, Springer International Publishing AG, pp. 633-643, International Conference on Intelligent Text Processing and Computational Linguistics, Kairo, Egypt, 14/04/2015. https://doi.org/10.1007/978-3-319-18111-0_48

Language Set Identification in Noisy Synthetic Multilingual Documents. / Jauhiainen, Tommi Sakari; Linden, Krister; Jauhiainen, Heidi Annika.

Computational Linguistics and Intelligent Text Processing. ed. / A. Gelbukh. Vol. Part I Springer International Publishing AG, 2015. p. 633-643 (Lecture Notes in Computer Science; Vol. 9041).

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Language Set Identification in Noisy Synthetic Multilingual Documents

AU - Jauhiainen, Tommi Sakari

AU - Linden, Krister

AU - Jauhiainen, Heidi Annika

N1 - Volume: Proceeding volume: Part I

PY - 2015

Y1 - 2015

N2 - In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an F 1-score of 97.6 when classifying between 44 languages.

AB - In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an F 1-score of 97.6 when classifying between 44 languages.

KW - 6121 Languages

KW - 113 Computer and information sciences

U2 - 10.1007/978-3-319-18111-0_48

DO - 10.1007/978-3-319-18111-0_48

M3 - Conference contribution

SN - 978-3-319-18110-3

VL - Part I

T3 - Lecture Notes in Computer Science

SP - 633

EP - 643

BT - Computational Linguistics and Intelligent Text Processing

A2 - Gelbukh, A.

PB - Springer International Publishing AG

ER -

Jauhiainen TS, Linden K, Jauhiainen HA. Language Set Identification in Noisy Synthetic Multilingual Documents. In Gelbukh A, editor, Computational Linguistics and Intelligent Text Processing. Vol. Part I. Springer International Publishing AG. 2015. p. 633-643. (Lecture Notes in Computer Science). https://doi.org/10.1007/978-3-319-18111-0_48