BruceDone · rumca-js · Jun 17, 2025 · Jun 17, 2025 · Jun 17, 2025 · Jun 17, 2025
diff --git a/README.md b/README.md
@@ -1,56 +1,56 @@
 # Awesome-crawler ![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)
-A collection of awesome web crawler,spider and resources in different languages.
+Information about web crawling and search engines (about how to access crawled pages).
 
-## Contents
+There is already awesome web scraping information, but web crawling is something slightly different, often more ethical thing.
+We do not have to duplicate awesome web scraping data. Therefore we provide something that adds value.
 
+## Contents
+- [Resources](#resources)
 - [Python](#python)
 - [Java](#java)
 - [C#](#c)
 - [JavaScript](#javascript)
 - [PHP](#php)
 - [C++](#c-1)
-- [C](#c-2)
 - [Ruby](#ruby)
 - [Rust](#rust)
-- [R](#r)
 - [Erlang](#erlang)
-- [Perl](#perl)
 - [Go](#go)
 - [Scala](#scala)
+- [Other](#other)
+- [Libraries](#libraries)
+
+## Resources
+* [Awesome web scraping](https://github.com/lorien/awesome-web-scraping) - resources about web scraping, tools, frameworks
+* [Common Crawl](https://commoncrawl.org/)
+* [Internet Archive](https://www.archive.org/)
+* [Anna's Archive](https://annas-archive.org/)
+* [Sci Hub](https://sci-hub.se/)
+* [Internet in a box](https://internet-in-a-box.org/)
+* [Internet Places Database](https://github.com/rumca-js/Internet-Places-Database) - crawled domain metadata archive
+
+Documents, papers
+* [Open Page Rank API](https://publicapi.dev/open-page-rank-api)
+* [Page Rank](https://en.wikipedia.org/wiki/PageRank)
 
 ## Python 
-* [Scrapy](https://github.com/scrapy/scrapy) - A fast high-level screen scraping and web crawling framework.
-    * [django-dynamic-scraper](https://github.com/holgerd77/django-dynamic-scraper) - Creating Scrapy scrapers via the Django admin interface.
-    * [Scrapy-Redis](https://github.com/rolando/scrapy-redis) - Redis-based components for Scrapy.
-    * [scrapy-cluster](https://github.com/istresearch/scrapy-cluster) - Uses Redis and Kafka to create a distributed on demand scraping cluster.
-    * [distribute_crawler](https://github.com/gnemoug/distribute_crawler) - Uses scrapy,redis, mongodb,graphite to create a distributed spider.
-* [pyspider](https://github.com/binux/pyspider) - A powerful spider system.
-* [CoCrawler](https://github.com/cocrawler/cocrawler) - A versatile web crawler built using modern tools and concurrency.
+* [distribute_crawler](https://github.com/gnemoug/distribute_crawler) - Uses scrapy,redis, mongodb,graphite to create a distributed spider.
+* [crawl4ai](https://github.com/unclecode/crawl4ai)
 * [cola](https://github.com/chineking/cola) - A distributed crawling framework.
-* [Demiurge](https://github.com/matiasb/demiurge) - PyQuery-based scraping micro-framework.
-* [Scrapely](https://github.com/scrapy/scrapely) - A pure-python HTML screen-scraping library.
-* [feedparser](http://pythonhosted.org/feedparser/) - Universal feed parser.
-* [you-get](https://github.com/soimort/you-get) -  Dumb downloader that scrapes the web.
-* [MechanicalSoup](https://github.com/hickford/MechanicalSoup) - A Python library for automating interaction with websites.
-* [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy.
 * [crawley](https://github.com/jmg/crawley) - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
-* [RoboBrowser](https://github.com/jmcarp/robobrowser) - A simple, Pythonic library for browsing the web without a standalone web browser.
-* [MSpider](https://github.com/manning23/MSpider) - A simple ,easy spider using gevent and js render. 
-* [brownant](https://github.com/douban/brownant) - A lightweight web data extracting framework.
-* [PSpider](https://github.com/xianhu/PSpider) - A simple spider frame in Python3.
 * [Gain](https://github.com/gaojiuli/gain) - Web crawling framework based on asyncio for everyone.
 * [sukhoi](https://github.com/iogf/sukhoi) - Minimalist and powerful Web Crawler.
 * [spidy](https://github.com/rivermont/spidy) - The simple, easy to use command line web crawler. 
-* [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3
-* [aspider](https://github.com/howie6879/aspider) - An async web scraping micro-framework based on asyncio. 
+* [tiny-web-crawler](https://github.com/DataCrawl-AI/datacrawl) - A simple and efficient web crawler for Python.
+* [crawler-buddy](https://github.com/rumca-js/crawler-buddy) - A crawling server with JSON interface
+* [CoCrawler](https://github.com/cocrawler/cocrawler) - A versatile web crawler built using modern tools and concurrency.
 
 ## Java
+* [Marginalia search](https://github.com/MarginaliaSearch/MarginaliaSearch) - Marginalia search crawler
 * [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search.
 * [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment.
-    * [anthelion](https://github.com/yahoo/anthelion) - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
+* [Apache Storm](https://stormcrawler.apache.org/)
 * [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler.
-* [JSoup](http://jsoup.org/) - Scrapes, parses, manipulates and cleans HTML.
-* [websphinx](http://www.cs.cmu.edu/~rcm/websphinx/) - Website-Specific Processors for HTML information extraction.
 * [Open Search Server](http://www.opensearchserver.com/) - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
 * [Gecco](https://github.com/xtuhcy/gecco) - A easy to use lightweight web crawler
 * [WebCollector](https://github.com/CrawlScript/WebCollector) - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
@@ -62,34 +62,27 @@ A collection of awesome web crawler,spider and resources in different languages.
 * [StormCrawler](http://github.com/DigitalPebble/storm-crawler/) - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
 * [Spark-Crawler](https://github.com/USCDataScience/sparkler) - Evolving Apache Nutch to run on Spark.
 * [webBee](https://github.com/pkwenda/webBee) - A DFS web spider.
-* [spider-flow](https://github.com/ssssssss-team/spider-flow) - A visual spider framework, it's so good that you don't need to write any code to crawl the website.
 * [Norconex Web Crawler](https://github.com/Norconex/collector-http) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications.
-
+* [Scrapegraph-ai](https://github.com/VinciGit00/Scrapegraph-ai) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications.
 
 ## C# 
 * [ccrawler](http://www.findbestopensource.com/product/ccrawler) - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content.
 * [SimpleCrawler](https://github.com/lei-zhu/SimpleCrawler) - Simple spider base on mutithreading, regluar expression.
-* [DotnetSpider](https://github.com/zlzforever/DotnetSpider) - This is a cross platfrom, ligth spider develop by C#.
 * [Abot](https://github.com/sjdirect/abot) - C# web crawler built for speed and flexibility.
 * [Hawk](https://github.com/ferventdesert/Hawk) - Advanced Crawler and ETL tool written in C#/WPF.
-* [SkyScraper](https://github.com/JonCanning/SkyScraper) - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
 * [Infinity Crawler](https://github.com/TurnerSoftware/InfinityCrawler) - A simple but powerful web crawler library in C#.
 
 ## JavaScript
-* [scraperjs](https://github.com/ruipgil/scraperjs) - A complete and versatile web scraper.
-* [scrape-it](https://github.com/IonicaBizau/scrape-it) - A Node.js scraper for humans.
+* [browsertrix crawler](https://github.com/webrecorder/browsertrix-crawler)
 * [simplecrawler](https://github.com/cgiffard/node-simplecrawler) - Event driven web crawler.
 * [node-crawler](https://github.com/bda-research/node-crawler) - Node-crawler has clean,simple api.
 * [js-crawler](https://github.com/antivanov/js-crawler) - Web crawler for Node.JS, both HTTP and HTTPS are supported.
 * [webster](https://github.com/zhuyingda/webster) - A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
 * [x-ray](https://github.com/lapwinglabs/x-ray) - Web scraper with pagination and crawler support.
-* [node-osmosis](https://github.com/rchipka/node-osmosis) - HTML/XML parser and web scraper for Node.js.
-* [web-scraper-chrome-extension](https://github.com/martinsbalodis/web-scraper-chrome-extension) - Web data extraction tool implemented as chrome extension.
 * [supercrawler](https://github.com/brendonboshell/supercrawler) - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. 
 * [headless-chrome-crawler](https://github.com/yujiosaka/headless-chrome-crawler) - Headless Chrome crawls with jQuery support
 * [Squidwarc](https://github.com/n0tan3rd/squidwarc) - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
-* [crawlee](https://github.com/apify/crawlee) - A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. 
-
+* [Sasori](https://github.com/karthikuj/sasori) - A dynamic web crawler built on Puppeteer with support for authentication.
 
 ## PHP
 * [Goutte](https://github.com/FriendsOfPHP/Goutte) - A screen scraping and web crawling library for PHP.
@@ -104,48 +97,40 @@ A collection of awesome web crawler,spider and resources in different languages.
 
 ## C++
 * [open-source-search-engine](https://github.com/gigablast/open-source-search-engine) - A distributed open source search engine and spider/crawler written in C/C++.
-
-## C
-* [httrack](https://github.com/xroche/httrack) - Copy websites to your computer.
+* [SpiderSuite](https://github.com/3nock/SpiderSuite) - An advance, cross-platform web security crawler. Built using C++ Qt framework.
 
 ## Ruby
 * [Nokogiri](https://github.com/sparklemotion/nokogiri) - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
 * [upton](https://github.com/propublica/upton) - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
 * [wombat](https://github.com/felipecsl/wombat) - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
 * [RubyRetriever](https://github.com/joenorton/rubyretriever) - RubyRetriever is a Web Crawler, Scraper & File Harvester.
-* [Spidr](https://github.com/postmodern/spidr) - Spider a site, multiple domains, certain links or infinitely.
 * [Cobweb](https://github.com/stewartmckee/cobweb) - Web crawler with very flexible crawling options, standalone or using sidekiq.
 * [mechanize](https://github.com/sparklemotion/mechanize) - Automated web interaction & crawling.
 
 ## Rust
 * [spider](https://github.com/spider-rs/spider) - The fastest web crawler and indexer.
 * [crawler](https://github.com/a11ywatch/crawler) - A gRPC web indexer turbo charged for performance.
 
-## R
-* [rvest](https://github.com/hadley/rvest) - Simple web scraping for R.
-
 ## Erlang 
 * [ebot](https://github.com/matteoredaelli/ebot) - A scalable, distribuited and highly configurable web cawler.
 
-## Perl
-* [web-scraper](https://github.com/miyagawa/web-scraper) - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.
-
-## Go
+## Go 
 * [pholcus](https://github.com/henrylee2cn/pholcus) -  A distributed, high concurrency and powerful web crawler.
 * [gocrawl](https://github.com/PuerkitoBio/gocrawl) - Polite, slim and concurrent web crawler.
 * [fetchbot](https://github.com/PuerkitoBio/fetchbot) - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
 * [go_spider](https://github.com/hu17889/go_spider) - An awesome Go concurrent Crawler(spider) framework. 
 * [dht](https://github.com/shiyanhui/dht) - BitTorrent DHT Protocol && DHT Spider.
 * [ants-go](https://github.com/wcong/ants-go) - A open source, distributed, restful crawler engine in golang.
-* [scrape](https://github.com/yhat/scrape) - A simple, higher level interface for Go web scraping.
 * [creeper](https://github.com/wspl/creeper) - The Next Generation Crawler Framework (Go).
-* [colly](https://github.com/asciimoo/colly) - Fast and Elegant Scraping Framework for Gophers.
-* [ferret](https://github.com/MontFerret/ferret) - Declarative web scraping.
-* [Dataflow kit](https://github.com/slotix/dataflowkit) - Extract structured data from web pages. Web sites scraping.
 * [Hakrawler](https://github.com/hakluke/hakrawler) - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
 
-
 ## Scala
 * [crawler](https://github.com/bplawler/crawler) - Scala DSL for web crawling.
 * [scrala](https://github.com/gaocegege/scrala) - Scala crawler(spider) framework, inspired by scrapy.
 * [ferrit](https://github.com/reggoodwin/ferrit) - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.
+
+## Other
+* [Easy Spider](https://github.com/NaiboWang/EasySpider)
+
+## Libraries
+* [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3