How to embed noncrossing trees in Universal Dependencies treebanks in a low-complexity regular language

Forskningsoutput: TidskriftsbidragArtikelVetenskapligPeer review

Sammanfattning

A recently proposed balanced-bracket encoding (Yli-Jyrä and Gómez-Rodríguez 2017) has given us a way to embed all noncrossing dependency graphs into the string space and to formulate their exact arc-factored inference problem (Kuhlmann and Johnsson 2015) as the best string problem in a dynamically constructed and weighted unambiguous context-free grammar. The current work improves the encoding
and makes it shallower by omitting redundant brackets from it. The streamlined encoding gives rise to a bounded-depth subset approximation that is represented by a small finite-state automaton. When bounded to 7 levels of balanced brackets, the automaton has 762 states and represents a strict superset of more than 99.9999% of the noncrossing
trees available in Universal Dependencies 2.4 (Nivre et al. 2019). In addition, it strictly contains all 15-vertex noncrossing digraphs. When bounded to 4 levels and 90 states, the automaton still captures 99.2% of all noncrossing trees in the reference dataset. The approach is flexible and extensible towards unrestricted graphs, and it suggests tight
finite-state bounds for dependency parsing, and for the main existing
parsing methods.
Bidragets översatta titelKuinka risteämättömät Universal Dependencies puupankkien puut voidaan voidaan upottaa matalakompleksiseen säännölliseen kieleen
Originalspråkengelska
TidskriftJournal of Language Modelling
Volym7
Nummer2
Sidor (från-till)177-232
Antal sidor56
ISSN2299-856X
DOI
StatusPublicerad - 2019
MoE-publikationstypA1 Tidskriftsartikel-refererad

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap
  • 6121 Språkvetenskaper

Citera det här