Implementing universal dependency, morphology, and multiword expression annotation standards for Turkish language processing

Forskningsoutput: TidskriftsbidragArtikelVetenskapligPeer review

Sammanfattning

Released only a year ago as the outputs of a research project (“Parsing Web 2.0 Sentences”, supported in part by a TUBİTAK 1001 grant (No. 112E276) and a part of the ICT COST Action PARSEME (IC1207)), IMST and IWT are currently the most comprehensive Turkish dependency treebanks in the literature. This article introduces the final states of our treebanks, as well as a newly integrated hierarchical categorization of the multiheaded dependencies and their organization in an exclusive deep dependency layer in the treebanks. It also presents the adaptation of recent studies on standardizing multiword expression and named entity annotation schemes for the Turkish language and integration of benchmark annotations into the dependency layers of our treebanks and the mapping of the treebanks to the latest Universal Dependencies (v2.0) standard, ensuring further compliance with rising universal annotation trends. In addition to significantly boosting the universal recognition of Turkish treebanks, our recent efforts have shown an improvement in their syntactic parsing performance (up to 77.8%/82.8% LAS and 84.0%/87.9% UAS for IMST/IWT, respectively). The final states of the treebanks are expected to be more suited to different natural language processing tasks, such as named entity recognition, multiword expression detection, transfer-based machine translation, semantic parsing, and semantic role labeling.
Originalspråkengelska
Artikelnummer43
TidskriftTurkish Journal of Electrical Engineering and Computer Sciences
Volym26
Utgåva3
Sidor (från-till)1662-1672
Antal sidor11
ISSN1300-0632
DOI
StatusPublicerad - 30 maj 2018
MoE-publikationstypA1 Tidskriftsartikel-refererad

Vetenskapsgrenar

  • 6121 Språkvetenskaper
  • 113 Data- och informationsvetenskap

Citera det här

@article{ff90ee8ef95e49b7b5493df49eb7d8a4,
title = "Implementing universal dependency, morphology, and multiword expression annotation standards for Turkish language processing",
abstract = "Released only a year ago as the outputs of a research project (“Parsing Web 2.0 Sentences”, supported in part by a TUBİTAK 1001 grant (No. 112E276) and a part of the ICT COST Action PARSEME (IC1207)), IMST and IWT are currently the most comprehensive Turkish dependency treebanks in the literature. This article introduces the final states of our treebanks, as well as a newly integrated hierarchical categorization of the multiheaded dependencies and their organization in an exclusive deep dependency layer in the treebanks. It also presents the adaptation of recent studies on standardizing multiword expression and named entity annotation schemes for the Turkish language and integration of benchmark annotations into the dependency layers of our treebanks and the mapping of the treebanks to the latest Universal Dependencies (v2.0) standard, ensuring further compliance with rising universal annotation trends. In addition to significantly boosting the universal recognition of Turkish treebanks, our recent efforts have shown an improvement in their syntactic parsing performance (up to 77.8{\%}/82.8{\%} LAS and 84.0{\%}/87.9{\%} UAS for IMST/IWT, respectively). The final states of the treebanks are expected to be more suited to different natural language processing tasks, such as named entity recognition, multiword expression detection, transfer-based machine translation, semantic parsing, and semantic role labeling.",
keywords = "6121 Languages, 113 Computer and information sciences",
author = "Umut Sulubacak",
year = "2018",
month = "5",
day = "30",
doi = "10.3906/elk-1706-81",
language = "English",
volume = "26",
pages = "1662--1672",
journal = "Turkish Journal of Electrical Engineering and Computer Sciences",
issn = "1300-0632",
publisher = "Turkiye Klinikleri Journal of Medical Sciences",
number = "3",

}

TY - JOUR

T1 - Implementing universal dependency, morphology, and multiword expression annotation standards for Turkish language processing

AU - Sulubacak, Umut

PY - 2018/5/30

Y1 - 2018/5/30

N2 - Released only a year ago as the outputs of a research project (“Parsing Web 2.0 Sentences”, supported in part by a TUBİTAK 1001 grant (No. 112E276) and a part of the ICT COST Action PARSEME (IC1207)), IMST and IWT are currently the most comprehensive Turkish dependency treebanks in the literature. This article introduces the final states of our treebanks, as well as a newly integrated hierarchical categorization of the multiheaded dependencies and their organization in an exclusive deep dependency layer in the treebanks. It also presents the adaptation of recent studies on standardizing multiword expression and named entity annotation schemes for the Turkish language and integration of benchmark annotations into the dependency layers of our treebanks and the mapping of the treebanks to the latest Universal Dependencies (v2.0) standard, ensuring further compliance with rising universal annotation trends. In addition to significantly boosting the universal recognition of Turkish treebanks, our recent efforts have shown an improvement in their syntactic parsing performance (up to 77.8%/82.8% LAS and 84.0%/87.9% UAS for IMST/IWT, respectively). The final states of the treebanks are expected to be more suited to different natural language processing tasks, such as named entity recognition, multiword expression detection, transfer-based machine translation, semantic parsing, and semantic role labeling.

AB - Released only a year ago as the outputs of a research project (“Parsing Web 2.0 Sentences”, supported in part by a TUBİTAK 1001 grant (No. 112E276) and a part of the ICT COST Action PARSEME (IC1207)), IMST and IWT are currently the most comprehensive Turkish dependency treebanks in the literature. This article introduces the final states of our treebanks, as well as a newly integrated hierarchical categorization of the multiheaded dependencies and their organization in an exclusive deep dependency layer in the treebanks. It also presents the adaptation of recent studies on standardizing multiword expression and named entity annotation schemes for the Turkish language and integration of benchmark annotations into the dependency layers of our treebanks and the mapping of the treebanks to the latest Universal Dependencies (v2.0) standard, ensuring further compliance with rising universal annotation trends. In addition to significantly boosting the universal recognition of Turkish treebanks, our recent efforts have shown an improvement in their syntactic parsing performance (up to 77.8%/82.8% LAS and 84.0%/87.9% UAS for IMST/IWT, respectively). The final states of the treebanks are expected to be more suited to different natural language processing tasks, such as named entity recognition, multiword expression detection, transfer-based machine translation, semantic parsing, and semantic role labeling.

KW - 6121 Languages

KW - 113 Computer and information sciences

U2 - 10.3906/elk-1706-81

DO - 10.3906/elk-1706-81

M3 - Article

VL - 26

SP - 1662

EP - 1672

JO - Turkish Journal of Electrical Engineering and Computer Sciences

JF - Turkish Journal of Electrical Engineering and Computer Sciences

SN - 1300-0632

IS - 3

M1 - 43

ER -