74 Open Source Deduplication Software Projects
Free and open source deduplication code projects including engines, APIs, generators, and tools.
Libpostal 3326 ⭐
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Yomguithereal Talisman 625 ⭐
Kopia 1337 ⭐
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Kvdo 179 ⭐
A pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
Mattilyra Lsh 211 ⭐
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Fingerprints 96 ⭐
Make it easier to compare and cross-reference the names of companies and people by applying strong normalisation.
Intraarchivededuplicator 85 ⭐
Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
Elemental Lf Benji 109 ⭐
Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
Record Linkage Resources 65 ⭐
Resources for tackling record linkage / deduplication / data matching problems
Deduplication 51 ⭐
Splink 141 ⭐
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Umicollapse 31 ⭐
Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
Dupandas 22 ⭐
:bar_chart: python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
Marty 12 ⭐
An efficient backup tool inspired by Git, saving your bandwidth and providing global deduplication at file level.
Cargo Limit 81 ⭐
Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
Entity Embed 90 ⭐
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Gencore 88 ⭐
Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
Zpaqfranz 25 ⭐
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Neural Scam Artist 15 ⭐
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Dedup Resolve Webpack Plugin 10 ⭐
Webpack plugin that resolves copies of the same module in different locations to a single path.
Postgresql Patterns Library 142 ⭐
Коллекция готовых SQL запросов для PostgreSQL по часто возникающим задачам (получение и модификация данных, обслуживание БД)