Open Source Libs
Find Open Source Packages
Open Source Libraries
👉
Data Pipeline
73 Open Source Data Pipeline Software Projects
Free and open source data pipeline code projects including engines, APIs, generators, and tools.
Kedro
6212 ⭐
A Python framework for creating reproducible, maintainable and modular data science code.
Data Engineering Howto
2105 ⭐
A list of useful resources to learn Data Engineering from scratch
Data Science On Gcp
994 ⭐
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Go Streams
830 ⭐
A lightweight stream processing library for Go
Infoslack Awesome Kafka
466 ⭐
A list about Apache Kafka
Nonechucks
340 ⭐
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Tributary
292 ⭐
Streaming reactive and dataflow graphs in Python
Scalable Data Science Platform
162 ⭐
Content for architecting a data science platform for products using Luigi, Spark & Flask.
Mobydq
160 ⭐
:whale: Tool to automate data quality checks on data pipelines
Blurr
96 ⭐
Data transformations for the ML era
Aws Pdf Textract Pipeline
113 ⭐
:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
Ob_bulkstash
106 ⭐
Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.
Pansori
120 ⭐
Tools for ASR Corpus Generation from Online Video
Hookah
85 ⭐
A cross-platform tool for data pipelines.
Serverless Data Pipeline Sam
77 ⭐
Serverless Data Pipeline powered by Kinesis Firehose, API Gateway, Lambda, S3, and Athena
Outbrain Aletheia
48 ⭐
Outbrain's data pipeline framework
Dc Sdk JS
50 ⭐
一个基于浏览器环境的数据采集SDK
Trembita
44 ⭐
Model complex data transformation pipelines easily
Delta Architecture
53 ⭐
Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline
Feagen
33 ⭐
(deprecated) A fast and memory-efficient Python data engineering framework for machine learning.
Pandas To Postgres
43 ⭐
Copy Pandas DataFrames and HDF5 files to PostgreSQL database
Stairs
44 ⭐
Framework which helps you to make parallel/distributed calculations using data pipelines
Ooni Pipeline
36 ⭐
OONI data processing pipeline
Mldotnet Real Time Data Streaming Workshop
37 ⭐
A Machine Learning and Real-Time Data Analytics Workshop
Network Pipeline
36 ⭐
Network traffic data pipeline for real-time predictions and building datasets for deep neural networks
Nyt Entity Service
23 ⭐
A web service for disambiguating and canonically storing entities.
Saisoku
36 ⭐
Saisoku is a Python module that helps you build complex pipelines of batch file/directory transfer/sync jobs.
Machine Learning Data Pipeline
22 ⭐
Pipeline module for parallel real-time data processing for machine learning models development and production purposes.
Jobanalytics_and_search
22 ⭐
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Airflowetl
18 ⭐
Blog post on ETL pipelines with Airflow
Opentrials Airflow
16 ⭐
Configuration and definitions of Airflow for OpenTrials
Richflow
16 ⭐
A Node.js and JavaScript synchronous data pipeline processing, data sharing and stream processing library. Actionable & Transformable Pipeline data processing.
Dataquest_eng
16 ⭐
Here's how to get DataQuest's Data Engineering Track missions' content to work on your localhost. Using data from my Valenbisi ARIMA modeling project, I document my steps using PostgreSQL, Postico, and the Command Line to get our DataQuest exercises running out of a Jupyter Notebook.
Automating Your Data Pipeline With Apache Airflow
28 ⭐
Automating Your Data Pipeline with Apache Airflow
Data Toolkit
28 ⭐
Data Pipeline Toolkit for Early-Stage Startups
Aws Data Pipeline Developer Guide
14 ⭐
The open source version of the AWS Data Pipeline documentation. To provide feedback & requests for changes, submit issues in this repository, or make proposed changes & submit a pull request.
Rpi
11 ⭐
RPJiOS: RPJ's RPi OS, a sensor data platform for the Raspberry Pi built with python2.7 and redis.
Data Pipeline Project
15 ⭐
Data pipeline project
Kestra
45 ⭐
Kestra, the modern, scalable orchestrator & scheduler open source platform.
Klio
680 ⭐
Smarter data pipelines for audio.
Kedro Viz
321 ⭐
Visualise your Kedro data pipelines.
Dataengineeringproject
361 ⭐
Example end to end data engineering project.
Snowplow
5966 ⭐
The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP
Whylogs Python
735 ⭐
Profile and monitor your ML data pipeline end-to-end , Join us in slack @ http://join.slack.whylabs.ai/
Cuelake
257 ⭐
Use SQL to build ELT pipelines on a data lakehouse.
Flupy
169 ⭐
Fluent data pipelines for python and your shell
Elementary Lineage
193 ⭐
Elementary is an open-source data observability framework for modern data teams, starting with data lineage.
Watchmen Matryoshka Doll
123 ⭐
Watchmen Platform is a low code data platform for data pipeline, meta data management , analysis, and quality management
Glue Public
110 ⭐
:fire: Data pipeline and automation tool.
Gusty
91 ⭐
Making DAG construction easier
Datajob
83 ⭐
Build and deploy a serverless data pipeline on AWS with no effort.
Basis Devkit
68 ⭐
Data pipelines from re-usable components
Scicloj.ml
88 ⭐
A Clojure machine learning library
Airflow Testing Ci Workflow
61 ⭐
(project & tutorial) dag pipeline tests + ci/cd setup
Tvdboom Atom
53 ⭐
Automated Tool for Optimized Modelling
Datatap Python
37 ⭐
Focus on Algorithm Design, Not on Data Wrangling
Practical Data Engineering
64 ⭐
Real estate dagster pipeline
Datalake Etl Pipeline
29 ⭐
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Data Streaming Using Kafka
21 ⭐
Built a stream processing data pipeline to get data from disparate systems into a dashboard using Kafka as an intermediary.
Augraphy
30 ⭐
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Dpex
20 ⭐
Distributed DataLoader For Pytorch Based On Ray
Pandemic Knowledge
17 ⭐
A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.
Spark Data Pipeline
14 ⭐
This project describes how to write full ETL data pipeline using spark.
Rivery_cli
16 ⭐
Rivery CLI
Slurpee
14 ⭐
A GUI frontend to manage blockchain ingestion with slurp
Vineeths96 Data Engineering Nanodegree
28 ⭐
This repository holds the python files and notebooks associated with the Udacity Data Engineering Nanodegree.
Ztqsteve Tap News
11 ⭐
A real-time news scraping and recommendation system
Dataengineercafe Data Engineering Book
33 ⭐
The Data Engineering Book - หนังสือวิศวกรรมข้อมูล ของคนไทย เพื่อคนไทย
Ordered Concurrently
12 ⭐
Ordered-concurrently a library for parallel processing with ordered output in Go. Process work concurrently / in parallel and returns output in a channel in the order of input. It is useful in concurrently / parallelly processing items in a queue, and get output in the order provided by the queue.
Dbt Snowplow Web
12 ⭐
A fully incremental model, that transforms raw web event data generated by the Snowplow JavaScript tracker into a series of derived tables of varying levels of aggregation.
Interestinglab Waterdrop
2966 ⭐
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Pydoit Doit
1291 ⭐
task management & automation tool
Conduitio Conduit
92 ⭐
Data Integration for Production Data Stores.