Open Source Libs
Find Open Source Packages
Open Source Libraries
👉
Deduplication
74 Open Source Deduplication Software Projects
Free and open source deduplication code projects including engines, APIs, generators, and tools.
Restic
15513 ⭐
Fast, secure, efficient backup program
Borg
7912 ⭐
Deduplicating archiver with compression and authenticated encryption.
Alertmanager
4655 ⭐
Prometheus Alertmanager
Libpostal
3351 ⭐
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Dupeguru
2464 ⭐
Find duplicate files
Rmlint
1177 ⭐
Extremely fast tool to remove duplicates and other lint from your filesystem
Borgmatic
1062 ⭐
Simple, configuration-driven backup software for servers and workstations
Rdedup
735 ⭐
Data deduplication engine, supporting optional compression and public key encryption.
Jdupes
1011 ⭐
A powerful duplicate file finder and an enhanced fork of 'fdupes'.
Yomguithereal Talisman
626 ⭐
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Recordlinkage
625 ⭐
A toolkit for record linkage and duplicate detection in Python
Kopia
1399 ⭐
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Data Matching Software
247 ⭐
A list of free data matching and record linkage software.
Kvdo
180 ⭐
A pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
Mattilyra Lsh
212 ⭐
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
F483 Dejavu
154 ⭐
Quickly detect already witnessed data.
Vdo
148 ⭐
Userspace tools for managing VDO volumes.
Spark Lucenerdd
125 ⭐
Spark RDD with Lucene's query and entity linkage capabilities
Fingerprints
98 ⭐
Make it easier to compare and cross-reference the names of companies and people by applying strong normalisation.
Blobstash
99 ⭐
You personal database. Mirror of https://git.sr.ht/~tsileo/blobstash
Intraarchivededuplicator
85 ⭐
Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
Rltk
88 ⭐
Record Linkage ToolKit (Find and link entities)
Dupd
94 ⭐
CLI utility to find duplicate files
Lieu
73 ⭐
Dedupe/batch geocode addresses and venues around the world with libpostal
Horcrux
69 ⭐
The Dropbox for IPFS (without the icky stuff)
Elemental Lf Benji
112 ⭐
Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
Record Linkage Resources
66 ⭐
Resources for tackling record linkage / deduplication / data matching problems
Deduplication
51 ⭐
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
Splink
147 ⭐
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Pgdedupe
39 ⭐
A simple command line interface to the datamade/dedupe library.
Fastcdc Rs
51 ⭐
FastCDC implementation in Rust
Watsondedupe
31 ⭐
Self-contained C# library for data deduplication using Sqlite
Sauvegarde
28 ⭐
Continuous data protection for GNU/Linux (cdpfgl).
X Ryl669 Frost
22 ⭐
A backup program that does deduplication, compression, encryption
Sparkclean
23 ⭐
A Scalable Data Cleaning Library for PySpark.
Imgdedup
30 ⭐
CLI tool for image duplicate detection
Umicollapse
31 ⭐
Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
Dupandas
22 ⭐
:bar_chart: python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
Rocketmqdeduplistener
129 ⭐
RocketMQ消息幂等去重消费者,支持使用MySQL或者Redis做幂等表,开箱即用
Py Image Dedup
68 ⭐
CLI utility to find near duplicate images and remove all but the best copy.
Spark Search
20 ⭐
Spark Search - high performance advanced search features based on Apache Lucene
Dedupsqlfs
19 ⭐
Deduplicating filesystem via Python3, FUSE and SQLite
Recordlinkage Annotator
27 ⭐
A browser user interface for manual labeling of record pairs.
Dduper
100 ⭐
Fast block-level out-of-band BTRFS deduplication tool.
Blobfs
12 ⭐
New project: https://git.sr.ht/~tsileo/blobfs
Marty
12 ⭐
An efficient backup tool inspired by Git, saving your bandwidth and providing global deduplication at file level.
Febrl Fork V0.4.2
21 ⭐
Fork of the Freely Extensible Biomedical Record Linkage program
Dedup
12 ⭐
A command-line tool for deduplicating entries in a file or stream.
Acid Store
43 ⭐
A library for secure, deduplicated, transactional, and verifiable data storage
Borg Qt
17 ⭐
A Qt frontend for the command line software BorgBackup.
Maildir Deduplicate
121 ⭐
📧 CLI to deduplicate mails from mail boxes.
Cargo Limit
82 ⭐
Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
Autorestic
370 ⭐
Config driven, easy backup cli for restic.
Zingg
371 ⭐
Scalable data mastering, deduplication and entity resolution.
Entity Embed
92 ⭐
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Gencore
90 ⭐
Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
Fs Curator
69 ⭐
Automation for the serious data hoarder that wants to have their data and use it
Yadf
26 ⭐
Yet Another Dupes Finder
Zpaqfranz
30 ⭐
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Fastcdc Py
16 ⭐
FastCDC implementation in Python https://pypi.org/project/fastcdc/
Bakdata Dedupe
16 ⭐
Java DSL for (online) deduplication
Nominally
16 ⭐
A maximum-strength name parser for record linkage.
Cafs
13 ⭐
Content-Addressable File System (used by BitWrk)
Imagedups
15 ⭐
图片查重、图片去重、Find/Delete duplicated images
Matchid Project Backend
10 ⭐
Backend (Docker & API) for matchID project
Neural Scam Artist
15 ⭐
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Vmchale Phash
10 ⭐
Perceptual hashing command-line tool
Dedup Resolve Webpack Plugin
10 ⭐
Webpack plugin that resolves copies of the same module in different locations to a single path.
Dwarfs
550 ⭐
A fast high compression read-only file system
Infinisil Soph
11 ⭐
Efficiently import pictures while handling duplicates gracefully
Mfdedup
10 ⭐
A Management Friendly Deduplication Prototype System for Backup
Postgresql Patterns Library
143 ⭐
Коллекция готовых SQL запросов для PostgreSQL по часто возникающим задачам (получение и модификация данных, обслуживание БД)
Bucketsync
10 ⭐
S3 backed FUSE Filesystem written in Go with dedup and encryption.
Marcnuth Deduplication
10 ⭐
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.