427 Open Source Big Data Software Projects
Free and open source big data code projects including engines, APIs, generators, and tools.
Data Science Ipython Notebooks 19637 ⭐
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Catboost 5446 ⭐
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
H2o 3 5016 ⭐
Open Source Fast Scalable Machine Learning Platform For Smarter Applications: Deep Learning, Gradient Boosting & XGBoost, Random Forest, Generalized Linear Modeling (Logistic Regression, Elastic Net), K-Means, PCA, Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Zeppelin 4952 ⭐
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Apache Couchdb 4699 ⭐
Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability
Stream Framework 4397 ⭐
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
Moloch 4414 ⭐
Moloch is an open source, large scale, full packet capturing, indexing, and database system.
Delta Io Delta 2815 ⭐
An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.
Vue Virtual Scroll List 2264 ⭐
⚡️A vue component support big amount data list with high render performance and efficient.
Presto 1426 ⭐
Home of the community managed version of Presto, the distributed SQL query engine for big data, under the auspices of the Presto Software Foundation.
Spark Py Notebooks 1275 ⭐
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Datumbox Framework 1057 ⭐
Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.
Moosefs 935 ⭐
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System
Dataflowjavasdk 852 ⭐
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Rakam API 766 ⭐
📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)
Autodl 772 ⭐
Automated Deep Learning without ANY human intervention. 1'st Solution for AutoDL [email protected]
Spark Movie Lens 732 ⭐
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Data Science Career 591 ⭐
Career Resources for Data Science, Machine Learning, Big Data and Business Analytics Career Repository
Nodefluent Kafka Streams 564 ⭐
equivalent to kafka-streams :octopus: for nodejs :sparkles::turtle::rocket::sparkles:
Thrill 510 ⭐
Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++
Athenz 469 ⭐
Open source platform for X.509 certificate based service authentication and fine grained access control in dynamic infrastructures. Athenz supports provisioning and configuration (centralized authorization) use cases as well as serving/runtime (decentralized authorization) use cases.
Decentralized Internet 380 ⭐
A SDK/library for decentralized web and distributing computing projects
Mockneat 367 ⭐
MockNeat is a Java 8+ library that facilitates the generation of arbitrary data for your applications.
Cloudbreak 299 ⭐
A tool for provisioning and managing Apache Hadoop clusters in the cloud. Cloudbreak, as part of the Hortonworks Data Platform, makes it easy to provision, configure and elastically grow HDP clusters on cloud infrastructure. Cloudbreak can be used to provision Hadoop across cloud infrastructure providers including AWS, Azure, GCP and OpenStack.