100 Open Source Crawling Software Projects
Free and open source crawling code projects including engines, APIs, generators, and tools.
APIfy JS 2555 ⭐
Gopa 274 ⭐
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Squidwarc 113 ⭐
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Linkedin Profile Scraper 107 ⭐
🕵️♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Dotnetcrawler 87 ⭐
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Proxifier 58 ⭐
A fast, modern and intelligent proxy rotator perfect for crawling and scraping public data.
Jonasjacek Robots.txt 56 ⭐
Simple robots.txt template. Keep unwanted robots out (disallow). White lists (allow) legitimate user-agents. Useful for all websites.
Learn.scrAPInghub.com 50 ⭐
Scrapinghub Learning Center. Report issues in Jira: Report issues in Jira: https://scrapinghub.atlassian.net/projects/WEB
Diffbot Php Client 50 ⭐
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Tech Seo Crawler 36 ⭐
Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.
Argus 42 ⭐
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: http://ftp.zew.de/pub/zew-docs/dp/dp18033.pdf
Crawlkit 23 ⭐
A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.
Xxx___dead___xxx 22 ⭐
b̶̡̪̬͒l̸̰̗̝̀ỏ̷̡̩g̴͇̑g̶̲̱̽͐i̵̹͗n̶̤̥͂̅̆g̴̮̾̅͜ ̷̧͎͆i̷̛͒͜͠n̸̥̺͒ ̶͚͚͊̿͜t̸̺͙̭̆̊̈́ḧ̶̟́̐e̸̱͔̟̓̓͝ ̶̨͔̾͛̑d̵̥̣̏ȧ̷̼̊r̷̰̝̥̅̌͝k̵̟̥̞̉̍͛
Baiduspider 26 ⭐
项目已经移动至：https://github.com/BaiduSpider/BaiduSpider ！！ 一个爬取百度搜索结果的爬虫，目前支持百度网页搜索，百度图片搜索，百度知道搜索，百度视频搜索，百度资讯搜索，百度文库搜索，百度经验搜索和百度百科搜索。
Serritor 14 ⭐
Img Cli 12 ⭐
An interactive Command-Line Interface Build in NodeJS for downloading a single or multiple images to disk from URL
Datamart 11 ⭐
Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index
Wget Lua 14 ⭐
Modern wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Web Search Engine UIc 10 ⭐
CS 582 Information Retrieval at University of Illinois at Chicago. Multithreaded crawling of UIC domain, inverted index, page rank, SEO with Context Pseudo-Relevance Feedback
Double Agent 10 ⭐
A test suite of common scraper detection techniques. See how detectable your scraper stack is.