72 Open Source Data Pipeline Software Projects
Free and open source data pipeline code projects including engines, APIs, generators, and tools.
Kedro 4815 ⭐
A Python framework for creating reproducible, maintainable and modular data science code.
Data Science On Gcp 985 ⭐
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Nonechucks 340 ⭐
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Scalable Data Science Platform 162 ⭐
Content for architecting a data science platform for products using Luigi, Spark & Flask.
Aws Pdf Textract Pipeline 112 ⭐
:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
Ob_bulkstash 106 ⭐
Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.
Serverless Data Pipeline Sam 77 ⭐
Serverless Data Pipeline powered by Kinesis Firehose, API Gateway, Lambda, S3, and Athena
Feagen 33 ⭐
(deprecated) A fast and memory-efficient Python data engineering framework for machine learning.
Stairs 44 ⭐
Framework which helps you to make parallel/distributed calculations using data pipelines
Mldotnet Real Time Data Streaming Workshop 37 ⭐
A Machine Learning and Real-Time Data Analytics Workshop
Network Pipeline 35 ⭐
Network traffic data pipeline for real-time predictions and building datasets for deep neural networks
Saisoku 36 ⭐
Saisoku is a Python module that helps you build complex pipelines of batch file/directory transfer/sync jobs.
Machine Learning Data Pipeline 22 ⭐
Pipeline module for parallel real-time data processing for machine learning models development and production purposes.
Jobanalytics_and_search 22 ⭐
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Richflow 15 ⭐
Dataquest_eng 16 ⭐
Here's how to get DataQuest's Data Engineering Track missions' content to work on your localhost. Using data from my Valenbisi ARIMA modeling project, I document my steps using PostgreSQL, Postico, and the Command Line to get our DataQuest exercises running out of a Jupyter Notebook.
Automating Your Data Pipeline With Apache Airflow 28 ⭐
Automating Your Data Pipeline with Apache Airflow
Aws Data Pipeline Developer Guide 13 ⭐
The open source version of the AWS Data Pipeline documentation. To provide feedback & requests for changes, submit issues in this repository, or make proposed changes & submit a pull request.
Rpi 11 ⭐
RPJiOS: RPJ's RPi OS, a sensor data platform for the Raspberry Pi built with python2.7 and redis.
Snowplow 5956 ⭐
The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP
Whylogs Python 723 ⭐
Profile and monitor your ML data pipeline end-to-end , Join us in slack @ http://join.slack.whylabs.ai/
Elementary Lineage 181 ⭐
Elementary is an open-source data observability framework for modern data teams, starting with data lineage.
Watchmen Matryoshka Doll 123 ⭐
Watchmen Platform is a low code data platform for data pipeline, meta data management , analysis, and quality management
Datalake Etl Pipeline 28 ⭐
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Data Streaming Using Kafka 21 ⭐
Built a stream processing data pipeline to get data from disparate systems into a dashboard using Kafka as an intermediary.
Augraphy 27 ⭐
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Pandemic Knowledge 17 ⭐
A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.
Vineeths96 Data Engineering Nanodegree 24 ⭐
This repository holds the python files and notebooks associated with the Udacity Data Engineering Nanodegree.
Dataengineercafe Data Engineering Book 33 ⭐
The Data Engineering Book - หนังสือวิศวกรรมข้อมูล ของคนไทย เพื่อคนไทย
Ordered Concurrently 12 ⭐
Ordered-concurrently a library for parallel processing with ordered output in Go. Process work concurrently / in parallel and returns output in a channel in the order of input. It is useful in concurrently / parallelly processing items in a queue, and get output in the order provided by the queue.
Dbt Snowplow Web 12 ⭐
Interestinglab Waterdrop 2789 ⭐
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).