202 Open Source Bigdata Software Projects
Free and open source bigdata code projects including engines, APIs, generators, and tools.
Tdengine 13921 ⭐
An open-source big data platform designed and optimized for the Internet of Things (IoT).
Awesome Bigdata 9344 ⭐
A curated list of awesome big data frameworks, ressources and other awesomeness.
Vaex 5163 ⭐
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀
Poli 1714 ⭐
An easy-to-use BI server built for SQL lovers. Power data analysis in SQL and gain faster business insights.
Dotnet Spark 1508 ⭐
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Griddb 1308 ⭐
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.
Spark Py Notebooks 1275 ⭐
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Ironmussa Optimus 939 ⭐
:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Spark Movie Lens 732 ⭐
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Kube Batch 724 ⭐
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
Bigdata Interview 700 ⭐
Coding Now 655 ⭐
Rdkmaster Jigsaw 342 ⭐
Jigsaw七巧板 provides a set of web components based on Angular5/8/9+. The main purpose of Jigsaw is to help the application developers to construct complex & intensive interacting & user friendly web pages. Jigsaw is supporting the development of all applications of Big Data Product of ZTE.
Feedirss API 333 ⭐
RSS as RESTful. This service allows you to transform RSS feed into an awesome API.
Datawave 326 ⭐
DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.
Mvillarrealb Docker Spark Cluster 246 ⭐
A simple spark standalone cluster for your testing environment purposses
Datafaker 249 ⭐
Datafaker is a large-scale test data and flow test data generation tool. Datafaker fakes data and inserts to varied data sources. 测试数据生成工具
Every Single Day I Tldr 234 ⭐
A daily digest of the articles or videos I've found interesting, that I want to share with you.
Big Data Rosetta Code 236 ⭐
Code snippets for solving common big data problems in various platforms. Inspired by Rosetta Code
Aws Etl Orchestrator 221 ⭐
A serverless architecture for orchestrating ETL jobs in arbitrarily-complex workflows using AWS Step Functions and AWS Lambda.
Hadoop Attack Library 214 ⭐
A collection of pentest tools and resources targeting Hadoop environments
Sparkrdma 204 ⭐
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Athenacli 132 ⭐
AthenaCLI is a CLI tool for AWS Athena service that can do auto-completion and syntax highlighting.
Azure Event Hubs Spark 130 ⭐
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Hadoopcryptoledger 123 ⭐
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Spark R Notebooks 110 ⭐
R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks
Kotlin Spark API 130 ⭐
This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
Tennis Crystal Ball 97 ⭐
Ultimate Tennis Statistics and Tennis Crystal Ball - Tennis Big Data Analysis and Prediction
Memverge Splash 94 ⭐
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
Tensorbase 111 ⭐
TensorBase is building a modern big data warehouse with performance in its core mind.
Covid19 Market Waiting Times 94 ⭐
A project to help people stand in line at the market as little as possible
Clustering4ever 89 ⭐
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
Ignite Book Code Samples 86 ⭐
All code samples, scripts and more in-depth examples for the book high performance in-memory computing with Apache Ignite. Please use the repository "the-apache-ignite-book" for Ignite version 2.6 or above.
Big Data Engineering Coursera Yandex 69 ⭐
Big Data for Data Engineers Coursera Specialization from Yandex
Meetups Archivos 61 ⭐
Ppts, códigos y videos de las meetups, data science days, videollamadas y workshops. Data Science Research es una organización sin fines de lucro que busca difundir, descentralizar y difundir los conocimientos en Ciencia de Datos e Inteligencia Artificial en el Perú, dando oportunidades a nuevos talentos mediante MeetUps, Workshops y Semilleros de Conocimiento e Investigación.
The Apache Ignite Book 45 ⭐
All code samples, scripts and more in-depth examples for The Apache Ignite Book. Include Apache Ignite 2.6 or above
Awesome Coder Resources 44 ⭐
Streambench 43 ⭐
Measuring the performance of popular streaming engines with Yahoo's Streaming Benchmark
Hadoopoffice 41 ⭐
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
Grapetree 34 ⭐
GrapeTree is a fully interactive, tree visualization program, which supports facile manipulations of both tree layout and metadata. Click the first link to launch: https://achtman-lab.github.io/GrapeTree/MSTree_holder.html
Telemetry Batch View 32 ⭐
A Scala framework to build derived datasets, aka batch views, of Telemetry data.
Reddit_sse_stream 33 ⭐
A Server Side Event stream to deliver Reddit comments and submissions in near real-time to a client.
Learn Hadoop And Spark 30 ⭐
This repository focuses on gathering and making a curated list resources to learn Hadoop for FREE.
Sparktwitteranalysis 26 ⭐
An Apache Spark standalone application using the Spark API in Scala. The application uses Simple Build Tool(SBT) for building the project.
Spark And Kafka_iot Data Processing And Analytics 25 ⭐
Final Project for IoT: Big Data Processing and Analytics class in UCSC Extension. Analyzing U.S nationwide temperature from IoT sensors in real-time
2019_egu_workshop_jupyter_notebooks 24 ⭐
Short course on interactive analysis of Big Earth Data with Jupyter Notebooks
Thepersonalmsds 22 ⭐
The Personal MS(DS) is an initiative to customize the Data Science Masters roadmap according to one's interests hence providing complete autonomy to the learner. The intuition behind #thepersonalmsds is to upgrade skills without formally enrolling into a Master's program at a University
Clickhouse Replication Example 21 ⭐
Create an example replicated + distributed dataset using Clickhouse
Firstyear 21 ⭐
This repository contains the work I've done in my first year along with some study materials which I had collected.
Bboxdb 21 ⭐
BBoxDB is a scalable, highly available and distributed data store for multi-dimensional big data. The software supports operations like hyperrectangle queries or spatial joins.
Jorgeacf Dockerfiles 22 ⭐
Multi docker container images for main Big Data Tools. (Hadoop, Spark, Kafka, HBase, Cassandra, Zookeeper, Zeppelin, Drill, Flink, Hive, Hue, Mesos, ... )
Etl Starter Kit 18 ⭐
:file_folder: Extract, Transform, Load (ETL) :construction_worker: refers to a process in database usage and especially in data warehousing. This repository contains a starter kit featuring ETL related work.
Bigquery Data Lineage 27 ⭐
Reference implementation for real-time Data Lineage tracking for BigQuery using Audit Logs, ZetaSQL and Dataflow.
Cwlab 17 ⭐
An open-source framework for simplified deployment of the Common Workflow Language using a graphical web interface
Jigsaw Seed 16 ⭐
这是组件库 Jigsaw-七巧板(https://github.com/rdkmaster/jigsaw) 的种子工程，建议所有新增的app都以这个工程作为种子开始构建。
Detedit 16 ⭐
A graphical user interface for annotating and editing events detected in long-term acoustic monitoring data
Gan_deeplearning4j 15 ⭐
Automatic feature engineering using Generative Adversarial Networks using Deeplearning4j and Apache Spark.
Spark Streaming Monitoring With Lightning 15 ⭐
Plot live-stats as graph from ApacheSpark application using Lightning-viz
Aws Auto Terminate Idle Emr 21 ⭐
AWS Auto Terminate Idle AWS EMR Clusters Framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
Gomap 15 ⭐
Run your MapReduce workloads as a single binary on a single machine with multiple CPUs and high memory. Pricing of a lot of small machines vs heavy machines is the same on most cloud providers.
Twitter Sentiment Analysis Using Hadoop 12 ⭐
A Project where one can fetch and read tweets and show the analysis like who is most influential
Alinous Elastic Db 11 ⭐
Alinous Elastic DB is database for bigdata. It can scale both the SQL engine and storage engine. This database engine is for scaling and sharding.
Leaflet_heatmap 12 ⭐
简单的可视化湖州通话数据 假设数据量很大，没法用浏览器直接绘制热力图，把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后，再使用Apache Spark绘制热力图，然后用leafletjs加载OpenStreetMap图层和热力图图层，以达到良好的交互效果。现在使用Apache Spark实现绘制，可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法，并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Computing With Data 11 ⭐
Code samples for my book "Computing with Data: An Introduction to the Data Industry"
Masterdatcom_bdcc_practice 10 ⭐
Practice and Workshop on BigData and Cloud Computing using Docker Containers and OpenNebula. HDFS, hadoop and spark+R
Techher 10 ⭐
Repo containing files for TechHer event and 'Let your Data tell you the Real Story: Advanced Analytics on Azure' hands on lab
Aggregation Viewer Client Feature Layer 10 ⭐