74 Open Source Deduplication Software Projects
Free and open source deduplication code projects including engines, APIs, generators, and tools.
Libpostal 3351 ⭐
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Yomguithereal Talisman 626 ⭐
Kopia 1399 ⭐
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Kvdo 180 ⭐
A pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
Mattilyra Lsh 212 ⭐
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Fingerprints 98 ⭐
Make it easier to compare and cross-reference the names of companies and people by applying strong normalisation.
Intraarchivededuplicator 85 ⭐
Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
Elemental Lf Benji 112 ⭐
Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
Record Linkage Resources 66 ⭐
Resources for tackling record linkage / deduplication / data matching problems
Deduplication 51 ⭐
Splink 147 ⭐
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Umicollapse 31 ⭐
Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
Dupandas 22 ⭐
:bar_chart: python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
Marty 12 ⭐
An efficient backup tool inspired by Git, saving your bandwidth and providing global deduplication at file level.
Cargo Limit 82 ⭐
Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
Entity Embed 92 ⭐
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Gencore 90 ⭐
Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
Zpaqfranz 30 ⭐
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Neural Scam Artist 15 ⭐
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Dedup Resolve Webpack Plugin 10 ⭐
Webpack plugin that resolves copies of the same module in different locations to a single path.
Postgresql Patterns Library 143 ⭐
Коллекция готовых SQL запросов для PostgreSQL по часто возникающим задачам (получение и модификация данных, обслуживание БД)