Yeni bir sözdizimsel işaretleme yönteminin kullanımıyla Türkçe'nin istatistiksel ayrıştırma başarımının artırılması

Forskningsoutput: AvhandlingMagisteruppsatsAvhandlingar

Sammanfattning

In this work, we present a critical analysis of the dependency grammar that has come to be the de facto standard for Turkish language processing studies. Although widely recognized and used in several Turkish corpora including the well-known METU-Sabancı Treebank (MST), the only major syntactically annotated Turkish corpus to date, the grammar is partly outdated, improvable and extensible. Moreover, the METU-Sabancı Treebank itself is often criticized for its inconsistent annotation and difficulty of parsing.

Many recent studies centered around the syntactic parsing of Turkish have focused on fine-tuning specific aspects of their parsing frameworks and failed to make a pivotal overall progress in their parsing performances. We take a detour from specific case studies that would only yield local performance improvements, and delve into the entire structure of the annotation framework. We investigate the current Turkish annotation conventions in detail, identify any flaws and deficiencies with respect to both manual annotation and automatic parsing, and then propose measures that might be taken to alleviate these issues.

Furthermore, as web data become increasingly available for study and the ability to efficiently parse non-canonical sentences gain importance, we place special emphasis on making dependency annotation as lenient on non-canonical texts as possible. The extent by which the colloquial language employed by social web users differs from well-typed formal language is indeed very large, and it is often not enough to orthographically normalize non-canonical sentences in a pre-processing routine to render them as successfully parsable as edited formal texts. As part of this work, we also attempt to parametrize the differences of the language of the web, and likewise suggest what morphosyntactic reforms would likely improve parsing performances.

In accordance with our findings, we also propose a new, improved dependency annotation framework for Turkish. The proposed framework additionally focuses on minimalism and ease of manual annotation, featuring only 16 dependency types that are decidedly more coherent and intuitive compared to the 26 labels of the original framework. We justify all of our proposed changes in the entailed dependency grammar from the original version by either showing conformity with the design principles we explain or demonstrating overlap with universally recognized conventions that have been long since proved.

As the first implementations of the proposed annotation framework, we introduce two new treebanks: 1) A new version of the METU-Sabancı Treebank keeping the same token structure and morphosyntactic features but reannotated with the new dependency types, for which we propose the name ITU-METU-Sabancı Treebank (IMST) in recognition of the considerable previous effort on the original treebank as well as our contribution, and 2) The ITU Web Treebank (IWT), the first Turkish corpus composed of non-canonical user-input sentences extracted from the web, annotated from the ground up in normalization, morphology and syntax layers. Both of our new corpora are marked for deep dependencies in order to support future semantic role labeling studies.

We do not establish any hierarchy between deep and surface dependencies, and rather employ a basic approach that simply supports multiple heads for a single constituent. Although this notation makes our syntactically annotated sentences incompatible with most syntactic parsers in common use, it is straightforward to remap from the multi-headed raw sentences to single-headed projections whenever necessary, and so it boosts the expressiveness of our syntactic annotation without incurring any loss in applicability. For our parsing tests that we discuss in the later sections, we use two elementary single-head choosing methods as a precursor to smarter head choosing routines that may be developed for future work.

Although constituents are conventionally annotated with a single head in dependency parsing, the practice is not always beneficial as there may be more than one head for a dependent that would make sense given clausal structures within the sentence containing the dependent. In such cases, while automatic parsers may predict a meaningful head for a given dependent, the gold-standard validation set may be annotated with another head that is also meaningful, but would still cause the prediction to be determined as incorrect simply because the two heads do not match. We mean for the newly-introduced multi-headed representation to also help in alleviating false negatives caused by such scenarios, by use of a new evaluation metric that we call "relaxed evaluation" (as opposed to the conventional "strict evaluation") able to validate predicted dependencies that match any one of the heads designated in the gold-standard.

After our discussions, we present our detailed empirical investigations on the new treebanks in order to demonstrate the impact of our proposed annotation schemes with respect to the original framework. We perform cross-validation on all of our models and cross-check parsing models trained from each combination of training sets and single-head choosing routines with each other where appropriate. We provide the figures resulting from our parsing tests and discuss their significance in detail. Additionally, we conduct a series of targeted remapping tests in order to make sure that certain annotation scheme changes were indeed well-founded and effective. Furthermore, our experiments indicate that the parsing performance increases we attain are not caused by the reduction of the dependency label set, but rather related to our more coherent annotation framework prescribed by the new grammar.

Our final tests show that our best model for the IMST attains labeled attachment scores of 75.1% for strict evaluation and 75.7% for relaxed evaluation, surpassing the state-of-the-art parsing score of 65.9% by a large margin. Cross-validation of the IWT also yields 79.7% for strict evaluation and 80.1% for relaxed evaluation for the best model. Considering these scores, our new resources reveal up to nearly 12 percentage points improvement on the performance of parsing web data.
Originalspråkturkiska
Tilldelande institution
  • Istanbul Technical University
Handledare
  • Eryiğit, Gülşen, Handledare, Extern person
Tilldelningsdatum26 jun 2015
UtgivningsortTurkey
Förlag
StatusPublicerad - 2015
MoE-publikationstypG2 Masteruppsats, polyteknisk masteruppsats

Forskningsdatauppsättningar

IMST Treebank

Sulubacak, U. (Skapad av), Pamay, T. (Skapad av), Istanbul Technical University, 9 apr 2016

Datauppsättning

ITU Web Treebank

Pamay, T. (Skapad av), Sulubacak, U. (Skapad av), Istanbul Technical University, 5 jun 2015

Datauppsättning

Citera det här