Guessing lexicon entries using finite-state methods

Research output: Chapter in Book/Report/Conference proceedingConference contributionProfessional


A practical method for interactive guessing of LEXC lexicon entries is presented. The method is based on describing groups of similarly inflected words using regular expressions. The patterns are compiled into a finite-state transducer (FST) which maps any word form into the possible LEXC lexicon entries which could generate it. The same FST can be used (1) for converting conventional
headword lists into LEXC entries, (2) for interactive guessing of entries, (3) for corpus-assisted interactive guessing and (4) guessing entries from corpora. A method of representing affixes as a table is presented as well how the tables can be converted into LEXC format for several different purposes including morphological analysis and entry guessing. The method has been implemented using the HFST finite-state transducer tools and its Python embedding plus
a number of small Python scripts for conversions. The method is tested with a near complete implementation of Finnish verbs. An experiment of generating Finnish verb entries out of corpus data is also described as well as a creation of a full-scale analyzer for Finnish verbs using the conversion patterns.
Original languageEnglish
Title of host publicationProceedings of the Fourth International Workshop on Computatinal Linguistics for Uralic Languages
EditorsTommi Pirinen, Michael Rießler, Jack Rueter, Trond Trosterud, Francis M. Tyers
Number of pages19
Place of PublicationStroudsburg
PublisherThe Association for Computational Linguistics
Publication dateJan 2018
Publication statusPublished - Jan 2018
MoE publication typeD3 Professional conference proceedings
EventInternational Workshop for Computational Linguistics of Uralic Languages - University of Helsinki, Department of Modern Languages, Helsinki, Finland
Duration: 8 Jan 20189 Jan 2018
Conference number: 4

Fields of Science

  • 6121 Languages
  • computational linguistics
  • language technology
  • finite-state methods
  • lexicon
  • 113 Computer and information sciences
  • natural language processing
  • finite-state methods

Cite this