OcWikiAnnot: Annotated Wikipedia Corpus of Occitan

Dataset

Description

OcWikiAnnot is a corpus of Wikipedia content in Occitan that is tokenized, PoS-tagged and lemmatized. The corpus contains 100 000 sentences for a total of 2 037 723 tokens. It is based on the Wikipedia corpus in Occitan that is part of the Leipzig Corpora Collection.
Date made available20 Apr 2023
PublisherZenodo
Date of data productionFeb 2023 - Apr 2023

Cite this