406 Open Source Hadoop Software Projects
Free and open source hadoop code projects including engines, APIs, generators, and tools.
Data Science Ipython Notebooks 22374 ⭐
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Spotify Luigi 15353 ⭐
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Eclipse Deeplearning4j 12338 ⭐
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
H2o 3 5701 ⭐
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Hadoop Book 3321 ⭐
Example source code accompanying O'Reilly's "Hadoop: The Definitive Guide" by Tom White
Presto 4797 ⭐
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Interestinglab Waterdrop 2966 ⭐
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
NagIOS Plugins 1039 ⭐
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
Moosefs 1211 ⭐
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Kylo 983 ⭐
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Dataspherestudio 1856 ⭐
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Harisekhon Dockerfiles 967 ⭐
50+ DockerHub public images for Docker & Kubernetes - DevOps, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak
Bigdata Interview 1143 ⭐
Dist Keras 620 ⭐
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Hadoop_study 783 ⭐
定期更新Hadoop生态圈中常用大数据组件文档 重心依次为: Flink Solr Sparksql ES Scala Kafka Hbase/phoenix Redis Kerberos (项目包含hadoop思维导图 印象笔记 Scala版本简单demo 常用工具类 去敏后的train code 持续更新!!!)
Gis Tools For Hadoop 492 ⭐
The GIS Tools for Hadoop are a collection of GIS tools for spatial analysis of big data.
Devops Python Tools 521 ⭐
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Ytk Learn 347 ⭐
Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).
Gather Deployment 341 ⭐
Gathers scalable Tensorflow and Python infrastructure deployment, Husein Go-To for development, 100% Docker.
Cascading 327 ⭐
Cascading is a feature rich API for defining and executing complex and fault tolerant data processing flows locally or on a cluster.
Cloudbreak 316 ⭐
CDP Public Cloud is an integrated analytics and data management platform deployed on cloud services. It offers broad data analytics and artificial intelligence functionality along with secure user access and data governance features.
Behemoth 286 ⭐
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Hadoop Mini Clusters 273 ⭐
hadoop-mini-clusters provides an easy way to test Hadoop projects directly in your IDE
Hadoop Attack Library 241 ⭐
A collection of pentest tools and resources targeting Hadoop environments
Sparkrdma 230 ⭐
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Hadoop Connectors 242 ⭐
Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.
Hive Jdbc Uber Jar 223 ⭐
Hive JDBC "uber" or "standalone" jar based on the latest Apache Hive version
Bigdata Playground 192 ⭐
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Hadoopcryptoledger 134 ⭐
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Dynamometer 125 ⭐
A tool for scale and performance testing of HDFS with a specific focus on the NameNode.
Introtohadoopandmr__udacity_course 116 ⭐
:elephant: Source code for assignments of Udacity course "Introduction to Hadoop and MapReduce"
Avro Hadoop Starter 112 ⭐
Example MapReduce jobs in Java, Hive, Pig, and Hadoop Streaming that work on Avro data.
Hdfs Shell 128 ⭐
HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS
Big Data Mapreduce Course 107 ⭐
Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University
Devops Bash Tools 433 ⭐
700+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Kafka, Docker, APIs, Hadoop, SQL, PostgreSQL, MySQL, Hive, Impala, Travis CI, Jenkins, Concourse, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LDAP, Code/Build Linting, pkg mgmt for Linux, Mac, Python, Perl, Ruby, NodeJS, Golang, Advanced dotfiles: .bashrc, .vimrc, .gitconfig, .screenrc, .tmux.conf, .psqlrc ...
Parquet4s 178 ⭐
Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
Nnanalytics 99 ⭐
NameNodeAnalytics is a self-help utility for scouting and maintaining the namespace of an HDFS instance.
Haproxy Configs 144 ⭐
80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Kubernetes, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.