Bounded-Depth High-Coverage Search Space for Noncrossing Parses

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

A recently proposed encoding for non- crossing digraphs can be used to imple- ment generic inference over families of these digraphs and to carry out first-order factored dependency parsing. It is now shown that the recent proposal can be substantially streamlined without information loss. The improved encoding is less dependent on hierarchical processing and it gives rise to a high-coverage bounded-depth approximation of the space of non- crossing digraphs. This subset is presented elegantly by a finite-state machine that recognises an infinite set of encoded graphs. The set includes more than 99.99% of the 0.6 million noncrossing graphs obtained from the UDv2 treebanks through planarisation. Rather than taking the low probability of the residual as a flat rate, it can be modelled with a joint probability distribution that is factorised into two underlying stochastic processes – the sentence length distribution and the related conditional distribution for deep nesting. This model points out that deep nesting in the streamlined code requires extreme sentence lengths. High depth is categorically out in common sentence lengths but emerges slowly at infrequent lengths that prompt further inquiry.
Original languageEnglish
Title of host publicationProceedings of the 13th International Conference on Finite State Methods and Natural Language Processing : FSMNLP 2017
EditorsFrank Drewes
Number of pages11
Place of PublicationStroudsburg
PublisherThe Association for Computational Linguistics
Publication date4 Sep 2017
Pages30-40
ISBN (Print)978-1-5108-4746-0
DOIs
Publication statusPublished - 4 Sep 2017
MoE publication typeA4 Article in conference proceedings
EventInternational Conference on Finite State Methods and Natural Language Processing (FSMNLP) - Umeå, Umeå, Sweden
Duration: 5 Sep 20177 Sep 2017
Conference number: 13

Fields of Science

  • 6121 Languages
  • dependency syntax
  • syntax
  • bracketing
  • finite-state automata
  • self-embedding
  • context-free grammar
  • sentence length
  • regular expressions
  • universal dependencies
  • treebanks
  • corpus linguistcs
  • projectivity
  • syntactic complexity
  • limits on embedding
  • 113 Computer and information sciences
  • state complexity
  • finite automata
  • context-free grammar
  • graphs
  • digraphs
  • superbrackets
  • finite-state approximation
  • truncated stack
  • 111 Mathematics
  • path-width
  • narrowness
  • regularity
  • 112 Statistics and probability
  • sentence length
  • syntactic complexity
  • Finnish TreeBank 1

    Bartis, I. (Other), FIN-CLARIN-konsortio, Nykykielten laitos, Helsingin yliopisto, 2015

    Dataset

Cite this