52 Open Source Deduplication Software Projects
Free and open source deduplication code projects including engines, APIs, generators, and tools.
Libpostal 2912 ⭐
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Yomguithereal Talisman 571 ⭐
Kopia 341 ⭐
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Kvdo 162 ⭐
A pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
Mattilyra Lsh 162 ⭐
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Fingerprints 85 ⭐
Make it easier to compare and cross-reference the names of companies and people by applying strong normalisation.
Intraarchivededuplicator 74 ⭐
Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
Elemental Lf Benji 65 ⭐
Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
Record Linkage Resources 47 ⭐
Resources for tackling record linkage / deduplication / data matching problems
Deduplication 43 ⭐
Splink 46 ⭐
Implementation in Apache Spark of the EM algorithm to estimate parameters of Fellegi-Sunter's canonical model of record linkage.
Umicollapse 21 ⭐
Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI).
Dupandas 18 ⭐
:bar_chart: python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
Marty 11 ⭐
An efficient backup tool inspired by Git, saving your bandwidth and providing global deduplication at file level.