Guessing lexicon entries using finite-state methods

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragProfessionell

Sammanfattning

A practical method for interactive guessing of LEXC lexicon entries is presented. The method is based on describing groups of similarly inflected words using regular expressions. The patterns are compiled into a finite-state transducer (FST) which maps any word form into the possible LEXC lexicon entries which could generate it. The same FST can be used (1) for converting conventional
headword lists into LEXC entries, (2) for interactive guessing of entries, (3) for corpus-assisted interactive guessing and (4) guessing entries from corpora. A method of representing affixes as a table is presented as well how the tables can be converted into LEXC format for several different purposes including morphological analysis and entry guessing. The method has been implemented using the HFST finite-state transducer tools and its Python embedding plus
a number of small Python scripts for conversions. The method is tested with a near complete implementation of Finnish verbs. An experiment of generating Finnish verb entries out of corpus data is also described as well as a creation of a full-scale analyzer for Finnish verbs using the conversion patterns.
Originalspråkengelska
Titel på gästpublikationProceedings of the Fourth International Workshop on Computatinal Linguistics for Uralic Languages
RedaktörerTommi Pirinen, Michael Rießler, Jack Rueter, Trond Trosterud, Francis M. Tyers
Antal sidor19
UtgivningsortStroudsburg
FörlagThe Association for Computational Linguistics
Utgivningsdatumjan 2018
Sidor59-77
StatusPublicerad - jan 2018
MoE-publikationstypD3 Professionella konferenshandlingar
EvenemangInternational Workshop for Computational Linguistics of Uralic Languages - University of Helsinki, Department of Modern Languages, Helsinki, Finland
Varaktighet: 8 jan 20189 jan 2018
Konferensnummer: 4
http://blogs.helsinki.fi/language-technology/iwclul-2018/

Vetenskapsgrenar

  • 6121 Språkvetenskaper
  • 113 Data- och informationsvetenskap

Citera det här

Koskenniemi, K. M. (2018). Guessing lexicon entries using finite-state methods. I T. Pirinen, M. Rießler, J. Rueter, T. Trosterud, & F. M. Tyers (Red.), Proceedings of the Fourth International Workshop on Computatinal Linguistics for Uralic Languages (s. 59-77). Stroudsburg: The Association for Computational Linguistics.
Koskenniemi, Kimmo Matti. / Guessing lexicon entries using finite-state methods. Proceedings of the Fourth International Workshop on Computatinal Linguistics for Uralic Languages. redaktör / Tommi Pirinen ; Michael Rießler ; Jack Rueter ; Trond Trosterud ; Francis M. Tyers. Stroudsburg : The Association for Computational Linguistics, 2018. s. 59-77
@inproceedings{93c86a9554a84fbcb745e6bf942632e5,
title = "Guessing lexicon entries using finite-state methods",
abstract = "A practical method for interactive guessing of LEXC lexicon entries is presented. The method is based on describing groups of similarly inflected words using regular expressions. The patterns are compiled into a finite-state transducer (FST) which maps any word form into the possible LEXC lexicon entries which could generate it. The same FST can be used (1) for converting conventionalheadword lists into LEXC entries, (2) for interactive guessing of entries, (3) for corpus-assisted interactive guessing and (4) guessing entries from corpora. A method of representing affixes as a table is presented as well how the tables can be converted into LEXC format for several different purposes including morphological analysis and entry guessing. The method has been implemented using the HFST finite-state transducer tools and its Python embedding plusa number of small Python scripts for conversions. The method is tested with a near complete implementation of Finnish verbs. An experiment of generating Finnish verb entries out of corpus data is also described as well as a creation of a full-scale analyzer for Finnish verbs using the conversion patterns.",
keywords = "6121 Languages, computational linguistics, language technology, finite-state methods, lexicon, 113 Computer and information sciences, natural language processing, finite-state methods",
author = "Koskenniemi, {Kimmo Matti}",
year = "2018",
month = "1",
language = "English",
pages = "59--77",
editor = "Tommi Pirinen and Michael Rie{\ss}ler and Jack Rueter and Trond Trosterud and Tyers, {Francis M.}",
booktitle = "Proceedings of the Fourth International Workshop on Computatinal Linguistics for Uralic Languages",
publisher = "The Association for Computational Linguistics",
address = "United States",

}

Koskenniemi, KM 2018, Guessing lexicon entries using finite-state methods. i T Pirinen, M Rießler, J Rueter, T Trosterud & FM Tyers (red), Proceedings of the Fourth International Workshop on Computatinal Linguistics for Uralic Languages. The Association for Computational Linguistics, Stroudsburg, s. 59-77, International Workshop for Computational Linguistics of Uralic Languages, Helsinki, Finland, 08/01/2018.

Guessing lexicon entries using finite-state methods. / Koskenniemi, Kimmo Matti.

Proceedings of the Fourth International Workshop on Computatinal Linguistics for Uralic Languages. red. / Tommi Pirinen; Michael Rießler; Jack Rueter; Trond Trosterud; Francis M. Tyers. Stroudsburg : The Association for Computational Linguistics, 2018. s. 59-77.

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKonferensbidragProfessionell

TY - GEN

T1 - Guessing lexicon entries using finite-state methods

AU - Koskenniemi, Kimmo Matti

PY - 2018/1

Y1 - 2018/1

N2 - A practical method for interactive guessing of LEXC lexicon entries is presented. The method is based on describing groups of similarly inflected words using regular expressions. The patterns are compiled into a finite-state transducer (FST) which maps any word form into the possible LEXC lexicon entries which could generate it. The same FST can be used (1) for converting conventionalheadword lists into LEXC entries, (2) for interactive guessing of entries, (3) for corpus-assisted interactive guessing and (4) guessing entries from corpora. A method of representing affixes as a table is presented as well how the tables can be converted into LEXC format for several different purposes including morphological analysis and entry guessing. The method has been implemented using the HFST finite-state transducer tools and its Python embedding plusa number of small Python scripts for conversions. The method is tested with a near complete implementation of Finnish verbs. An experiment of generating Finnish verb entries out of corpus data is also described as well as a creation of a full-scale analyzer for Finnish verbs using the conversion patterns.

AB - A practical method for interactive guessing of LEXC lexicon entries is presented. The method is based on describing groups of similarly inflected words using regular expressions. The patterns are compiled into a finite-state transducer (FST) which maps any word form into the possible LEXC lexicon entries which could generate it. The same FST can be used (1) for converting conventionalheadword lists into LEXC entries, (2) for interactive guessing of entries, (3) for corpus-assisted interactive guessing and (4) guessing entries from corpora. A method of representing affixes as a table is presented as well how the tables can be converted into LEXC format for several different purposes including morphological analysis and entry guessing. The method has been implemented using the HFST finite-state transducer tools and its Python embedding plusa number of small Python scripts for conversions. The method is tested with a near complete implementation of Finnish verbs. An experiment of generating Finnish verb entries out of corpus data is also described as well as a creation of a full-scale analyzer for Finnish verbs using the conversion patterns.

KW - 6121 Languages

KW - computational linguistics

KW - language technology

KW - finite-state methods

KW - lexicon

KW - 113 Computer and information sciences

KW - natural language processing

KW - finite-state methods

M3 - Conference contribution

SP - 59

EP - 77

BT - Proceedings of the Fourth International Workshop on Computatinal Linguistics for Uralic Languages

A2 - Pirinen, Tommi

A2 - Rießler, Michael

A2 - Rueter, Jack

A2 - Trosterud, Trond

A2 - Tyers, Francis M.

PB - The Association for Computational Linguistics

CY - Stroudsburg

ER -

Koskenniemi KM. Guessing lexicon entries using finite-state methods. I Pirinen T, Rießler M, Rueter J, Trosterud T, Tyers FM, redaktörer, Proceedings of the Fourth International Workshop on Computatinal Linguistics for Uralic Languages. Stroudsburg: The Association for Computational Linguistics. 2018. s. 59-77