559 Open Source Big Data Software Projects
Free and open source big data code projects including engines, APIs, generators, and tools.
Data Science Ipython Notebooks 22374 ⭐
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Catboost 6315 ⭐
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
H2o 3 5701 ⭐
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Zeppelin 5561 ⭐
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Apache Couchdb 5200 ⭐
Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability
Stream Framework 4584 ⭐
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
Moloch 5031 ⭐
Arkime (formerly Moloch) is an open source, large scale, full packet capturing, indexing, and database system.
Delta Io Delta 3983 ⭐
An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.
Vue Virtual Scroll List 3262 ⭐
⚡️A vue component support big amount data list with high render performance and efficient.
Presto 4797 ⭐
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Spark Py Notebooks 1424 ⭐
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Datumbox Framework 1076 ⭐
Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.
Moosefs 1211 ⭐
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Dataflowjavasdk 859 ⭐
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Rakam API 788 ⭐
📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)
Autodl 935 ⭐
Automated Deep Learning without ANY human intervention. 1'st Solution for AutoDL [email protected]
Spark Movie Lens 769 ⭐
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Data Science Career 692 ⭐
Career Resources for Data Science, Machine Learning, Big Data and Business Analytics Career Repository
Nodefluent Kafka Streams 715 ⭐
equivalent to kafka-streams :octopus: for nodejs :sparkles::turtle::rocket::sparkles:
Thrill 543 ⭐
Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++
Athenz 674 ⭐
Open source platform for X.509 certificate based service authentication and fine grained access control in dynamic infrastructures. Athenz supports provisioning and configuration (centralized authorization) use cases as well as serving/runtime (decentralized authorization) use cases.
Cogcomp Nlp 432 ⭐
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
Listenbrainz Server 466 ⭐
Decentralized Internet 449 ⭐
A SDK/library for decentralized web and distributing computing projects