26 Open Source Corpora Software Projects
Free and open source corpora code projects including engines, APIs, generators, and tools.
Entity Recognition Datasets 766 ⭐
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Self_dialogue_corpus 95 ⭐
The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports
Arabic News Article Classification 58 ⭐
Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
Evalution 14 ⭐
Dataset containing Semantic Relations and Metadata, for Training and Evaluating Distributional Semantic Models in English and Mandarin Chinese
Lyrics Corpora 15 ⭐
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Biomedical_corpora 13 ⭐
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). This has been published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
Textstelle 12 ⭐
Textstelle is a collection of corpora for the creation of bots and other things that generate text 🤖