34 Open Source Corpora Software Projects
Free and open source corpora code projects including engines, APIs, generators, and tools.
Entity Recognition Datasets 1076 ⭐
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Self_dialogue_corpus 103 ⭐
The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports
Arabic News Article Classification 74 ⭐
Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
Parallel Corpora Tools 32 ⭐
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Evalution 15 ⭐
Dataset containing Semantic Relations and Metadata, for Training and Evaluating Distributional Semantic Models in English and Mandarin Chinese
Lyrics Corpora 16 ⭐
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Biomedical_corpora 17 ⭐
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
Textstelle 14 ⭐
Textstelle is a collection of corpora for the creation of bots and other things that generate text 🤖
Lm Spanish 170 ⭐
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
Awesome Cantonese Nlp 31 ⭐
A curated list of resources dedicated to Natural Language Processing (NLP) of Cantonese | 粵語 NLP