Efficient Discrimination Between Closely Related Languages

Jörg Tiedemann, Nikola Ljubesi

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

In this paper, we revisit the problem of language identification with the focus on proper discrimination between closely related languages. Strong similarities between certain languages
make it very hard to classify them correctly using standard methods that have been proposed inthe literature. Dedicated models that focus on specific discrimination tasks help to improve theaccuracy of general-purpose language identification tools. We propose and compare methodsbased on simple document classification techniques trained on parallel corpora of closely relatedlanguages and methods that emphasize discriminating features in terms of blacklisted words.Our experiments demonstrate that these techniques are highly accurate for the difficult taskof discriminating between Bosnian, Croatian and Serbian. The best setup yields an absolute improvement of over 9% in accuracy over the best performing baseline using a state-of-the-artlanguage identification tool.
Original languageEnglish
Title of host publicationUnknown host publication
Number of pages16
Publication date1 Dec 2012
Pages2619-2634
Publication statusPublished - 1 Dec 2012
MoE publication typeA4 Article in conference proceedings

Fields of Science

  • 6121 Languages
  • language identification
  • language discrimination

Cite this