187 Open Source Corpus Software Projects
Free and open source corpus code projects including engines, APIs, generators, and tools.
Dariusk Corpora 4311 ⭐
A collection of small corpuses of interesting data for the creation of bots and similar stuff.
Awesome Deeplearning Resources 2503 ⭐
Deep Learning and deep reinforcement learning research papers and some codes
Weibo_terminater 2291 ⭐
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Cluebenchmark Clue 2495 ⭐
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Awesome Persian Nlp Ir 524 ⭐
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Wordless 455 ⭐
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Fakenewscorpus 305 ⭐
A dataset of millions of news articles scraped from a curated list of data sources.
Nlvr 204 ⭐
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Wp2txt 146 ⭐
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
Code Docstring Corpus 151 ⭐
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Colibri Core 115 ⭐
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Malay Dataset 156 ⭐
Text corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Nlp_bahasa_resources 231 ⭐
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Russian_news_corpus 80 ⭐
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Japanese Words To Vectors 72 ⭐
Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
Folia 54 ⭐
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
Pre Modern_chinese_corpus_dataset 91 ⭐
一个近代汉语语料库数据集 This is a pre-modern Chinese ( From Song dynasty in 10th century AD to Republic of China in the early 20th Century ) language corpus.These language resources are all txt format,arranged by Dynasty（Song,Yuan,Ming,Early-Qing,Late-Qing and Republic of China）.The relevant authors' information and types of literature also have been labelled.
Named Entity Recognition Template 51 ⭐
Build a deep learning model for predicting the named entities from text.
Streusle 50 ⭐
STREUSLE: a corpus with comprehensive lexical semantic annotation (multiword expressions, supersenses)
German_nouns 71 ⭐
A list of ~98,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.
Typing Assistant 37 ⭐
Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Eventstoryline 51 ⭐
Event StoryLine Corpus - annotated data, baselines and evaluation scripts, evaluation data.
Probabilistic Rnn Da Classifier 22 ⭐
Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model
Gmftbygmftby Opendialog 89 ⭐
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统，一键部署微信闲聊机器人)
Wordfish Python 19 ⭐
extract relationships from standardized terms from corpus of interest with deep learning :fish: