160 Open Source Apache Spark Software Projects
Free and open source apache spark code projects including engines, APIs, generators, and tools.
Oryxproject Oryx 1777 ⭐
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Dotnet Spark 1705 ⭐
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Spark On K8s Operator 1698 ⭐
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Ironmussa Optimus 1127 ⭐
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Kafka Storm Starter 725 ⭐
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Goodreads_etl_pipeline 873 ⭐
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Dist Keras 618 ⭐
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Openscoring 559 ⭐
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
Wirbelsturm 334 ⭐
Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
Sparkmeasure 405 ⭐
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.
Opencypher Morpheus 310 ⭐
Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.
Spark Jupyter Aws 263 ⭐
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Data Accelerator 257 ⭐
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
Sparkrdma 229 ⭐
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Awesome Ai Infrastructures 254 ⭐
Infrastructures™ for Machine Learning Training/Inference in Production.
Bigdata Playground 189 ⭐
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Azure Event Hubs Spark 169 ⭐
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Memverge Splash 110 ⭐
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
Learningsparkv2 524 ⭐
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Scalable Data Science 146 ⭐
Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.
Seznam Euphoria 78 ⭐
Euphoria is an open source Java API for creating unified big-data processing flows. It provides an engine independent programming model which can express both batch and stream transformations.
Sparksql For Hbase 69 ⭐
Learn how to use Spark SQL and HSpark connector package to create / query data tables that reside in HBase region servers
Kafka Streaming Click Analysis 67 ⭐
Use Kafka and Apache Spark streaming to perform click stream analytics
Mmtf Pyspark 58 ⭐
Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Spark Tda 47 ⭐
SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.
Liquidsvm 54 ⭐
Support vector machines (SVMs) and related kernel-based learning algorithms are a well-known class of machine learning algorithms, for non-parametric classification and regression. liquidSVM is an implementation of SVMs whose key features are: fully integrated hyper-parameter selection, extreme speed on both small and large data sets, full flexibility for experts, and inclusion of a variety of different learning scenarios: multi-class classification, ROC, and Neyman-Pearson learning, and least-squares, quantile, and expectile regression.
Ansible Spark Cluster 48 ⭐
Ansible roles to install an Spark Standalone cluster (HDFS/Spark/Jupyter Notebook) or Ambari based Spark cluster
Spark Twitter Sentiment Analysis 44 ⭐
Sentiment Analysis of a Twitter Topic with Spark Structured Streaming
Spark As Service Using Embedded Server 49 ⭐
This application comes as Spark2.1-as-Service-Provider using an embedded, Reactive-Streams-based, fully asynchronous HTTP server
Real Time Stream Processing Engine 39 ⭐
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Spark Transformers 38 ⭐
Spark-Transformers: Library for exporting Apache Spark MLLIB models to use them in any Java application with no other dependencies.
Bigclam Apachespark 38 ⭐
Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark
Aws Kinesis Scala 36 ⭐
Scala client for Amazon Kinesis. Also provides write to Kinesis capability for Apache Spark or Spark Streaming.