Projects per year
Abstract
This paper presents recent contributions to the creation of NLP tools and resources for Occitan. Several existing resources were modified or adapted, in particular a rule-based tokenizer, a morphosyntactic lexicon and a treebank. These resources were used to train and evaluate neural lemmatization models. As part of these experiments, a large corpus based on Wikipedia (2 million tokens) was POS-tagged and lemmatized. This new resource is shared through Zenodo.
Translated title of the contribution | New resources and lemmatization experiments for Occitan |
---|---|
Original language | French |
Title of host publication | Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) : Volume 1: travaux de recherche originaux - articles longs |
Editors | Christophe Servan, Anne Vilnat |
Number of pages | 15 |
Place of Publication | Paris |
Publisher | Association pour le Traitement Automatique des Langues |
Publication date | Jun 2023 |
Pages | 217-231 |
Publication status | Published - Jun 2023 |
MoE publication type | D3 Professional conference proceedings |
Event | Conférence sur le Traitement Automatique des Langues Naturelles - Paris, France Duration: 5 Jun 2023 → 9 Jun 2023 Conference number: 30 |
Fields of Science
- 6121 Languages
- Occitan
- lemmatisation
- low-resourced languages
Projects
- 1 Active
-
CorCoDial: Corpus-based computational dialectology: exploiting machine translation techniques to extract, visualize and interpret dialectal patterns
Scherrer, Y. (Project manager), Tiedemann, J. (Project manager), Mickus, T. (Participant), Miletic Haddad, A. (Participant), Psaltaki, E. (Participant), Roemling, D. (Participant), Siewert, J. (Participant) & Siewert, J. (Participant)
Suomen Akatemia Projektilaskutus
01/09/2021 → 31/08/2025
Project: Research Council of Finland: Academy Project
Datasets
-
OcWikiAnnot: Annotated Wikipedia Corpus of Occitan
Miletic Haddad, A. (Creator), Zenodo, 20 Apr 2023
DOI: 10.5281/zenodo.7777340, https://doi.org/10.5281/zenodo.7777340
Dataset