OcWikiAnnot: Annotated Wikipedia Corpus of Occitan

Datauppsättning

Beskrivning

OcWikiAnnot is a corpus of Wikipedia content in Occitan that is tokenized, PoS-tagged and lemmatized. The corpus contains 100 000 sentences for a total of 2 037 723 tokens. It is based on the Wikipedia corpus in Occitan that is part of the Leipzig Corpora Collection.
Datum som det gjorts tillgängligt20 apr. 2023
FörlagZenodo
Datum för dataproduktionfeb. 2023 - apr. 2023

Citera det här