139 Open Source Corpus Software Projects
Free and open source corpus code projects including engines, APIs, generators, and tools.
Dariusk Corpora 3868 ⭐
A collection of small corpuses of interesting data for the creation of bots and similar stuff.
Awesome Deeplearning Resources 2308 ⭐
Deep Learning and deep reinforcement learning research papers and some codes
Weibo_terminater 2275 ⭐
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Cluebenchmark Clue 1312 ⭐
中文语言理解基准测评 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Awesome Persian Nlp Ir 415 ⭐
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Wordless 313 ⭐
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Fakenewscorpus 228 ⭐
A dataset of millions of news articles scraped from a curated list of data sources.
Nlvr 186 ⭐
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Wp2txt 144 ⭐
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
Code Docstring Corpus 125 ⭐
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Colibri Core 109 ⭐
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Malay Dataset 97 ⭐
Text corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Nlp_bahasa_resources 102 ⭐
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Russian_news_corpus 76 ⭐
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Japanese Words To Vectors 58 ⭐
Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
Folia 48 ⭐
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
Pre Modern_chinese_corpus_dataset 49 ⭐
一个近代汉语语料库数据集 This is a pre-modern Chinese ( From Song dynasty in 10th century AD to Republic of China in the early 20th Century ) language corpus.These language resources are all txt format,arranged by Dynasty（Song,Yuan,Ming,Early-Qing,Late-Qing and Republic of China）.The relevant authors' information and types of literature also have been labelled.
Named Entity Recognition Template 45 ⭐
Build a deep learning model for predicting the named entities from text.
Streusle 36 ⭐
STREUSLE: a corpus with comprehensive lexical semantic annotation (multiword expressions, supersenses)
German_nouns 34 ⭐
A list of ~ 90,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus some methods to query the data.
Typing Assistant 29 ⭐
Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Eventstoryline 25 ⭐
Event StoryLine Corpus - annotated data, baselines and evaluation scripts, evaluation data.
Probabilistic Rnn Da Classifier 18 ⭐
Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model
Sogou News Text Classification 16 ⭐
Text classification with Machine Learning methods and Pre-Trained Embedding model on Sogou News Corpus
Gmftbygmftby Opendialog 30 ⭐
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统，一键部署微信闲聊机器人)
Wordfish Python 15 ⭐
extract relationships from standardized terms from corpus of interest with deep learning :fish: