130 Open Source Crawling Software Projects
Free and open source crawling code projects including engines, APIs, generators, and tools.
APIfy JS 3181 ⭐
Gopa 288 ⭐
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Squidwarc 140 ⭐
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Linkedin Profile Scraper 237 ⭐
🕵️♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Dotnetcrawler 116 ⭐
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Proxifier 76 ⭐
A fast, modern and intelligent proxy rotator perfect for crawling and scraping public data.
Jonasjacek Robots.txt 69 ⭐
Simple robots.txt template. Keep unwanted robots out (disallow). White lists (allow) legitimate user-agents. Useful for all websites.
Learn.scrAPInghub.com 52 ⭐
Scrapinghub Learning Center. Report issues in Jira: Report issues in Jira: https://scrapinghub.atlassian.net/projects/WEB
Diffbot Php Client 53 ⭐
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Tech Seo Crawler 49 ⭐
Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.
Argus 67 ⭐
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
Crawlkit 23 ⭐
A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.
Xxx___dead___xxx 21 ⭐
b̶̡̪̬͒l̸̰̗̝̀ỏ̷̡̩g̴͇̑g̶̲̱̽͐i̵̹͗n̶̤̥͂̅̆g̴̮̾̅͜ ̷̧͎͆i̷̛͒͜͠n̸̥̺͒ ̶͚͚͊̿͜t̸̺͙̭̆̊̈́ḧ̶̟́̐e̸̱͔̟̓̓͝ ̶̨͔̾͛̑d̵̥̣̏ȧ̷̼̊r̷̰̝̥̅̌͝k̵̟̥̞̉̍͛
Baiduspider 28 ⭐
项目已经移动至：https://github.com/BaiduSpider/BaiduSpider ！！ 一个爬取百度搜索结果的爬虫，目前支持百度网页搜索，百度图片搜索，百度知道搜索，百度视频搜索，百度资讯搜索，百度文库搜索，百度经验搜索和百度百科搜索。
Serritor 23 ⭐
Img Cli 14 ⭐
An interactive Command-Line Interface Build in NodeJS for downloading a single or multiple images to disk from URL
Datamart 26 ⭐
Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index
Wget Lua 46 ⭐
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Web Search Engine UIc 14 ⭐
CS 582 Information Retrieval at University of Illinois at Chicago. Multithreaded crawling of UIC domain, inverted index, page rank, SEO with Context Pseudo-Relevance Feedback
Double Agent 66 ⭐
A test suite of common scraper detection techniques. See how detectable your scraper stack is.