From 2d307afc94e8b2a81531852f5f7cba345bfb91ee Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 09:21:33 +0200 Subject: [PATCH 01/26] Added crawler-buddy --- README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 2fae14f..7bd0340 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,8 @@ A collection of awesome web crawler,spider and resources in different languages. * [sukhoi](https://github.com/iogf/sukhoi) - Minimalist and powerful Web Crawler. * [spidy](https://github.com/rivermont/spidy) - The simple, easy to use command line web crawler. * [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3 -* [aspider](https://github.com/howie6879/aspider) - An async web scraping micro-framework based on asyncio. +* [aspider](https://github.com/howie6879/aspider) - An async web scraping micro-framework based on asyncio. +* [crawler-buddy](https://github.com/rumca-js/crawler-buddy) - A crawling server with JSON interface ## Java * [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search. @@ -130,7 +131,7 @@ A collection of awesome web crawler,spider and resources in different languages. ## Perl * [web-scraper](https://github.com/miyagawa/web-scraper) - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions. -## Go +## Go * [pholcus](https://github.com/henrylee2cn/pholcus) - A distributed, high concurrency and powerful web crawler. * [gocrawl](https://github.com/PuerkitoBio/gocrawl) - Polite, slim and concurrent web crawler. * [fetchbot](https://github.com/PuerkitoBio/fetchbot) - A simple and flexible web crawler that follows the robots.txt policies and crawl delays. @@ -149,3 +150,6 @@ A collection of awesome web crawler,spider and resources in different languages. * [crawler](https://github.com/bplawler/crawler) - Scala DSL for web crawling. * [scrala](https://github.com/gaocegege/scrala) - Scala crawler(spider) framework, inspired by scrapy. * [ferrit](https://github.com/reggoodwin/ferrit) - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra. + +## Crawl Archives +* [Internet Places Database](https://github.com/rumca-js/Internet-Places-Database) - crawled domain metadata archive From 0bafcfdae36d82d83d29d448d7e2051f85ad953a Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 09:27:05 +0200 Subject: [PATCH 02/26] Added awesome web scraping --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 7bd0340..120cc44 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ A collection of awesome web crawler,spider and resources in different languages. ## Contents - +- [Resources](#resources) - [Python](#python) - [Java](#java) - [C#](#c) @@ -18,6 +18,9 @@ A collection of awesome web crawler,spider and resources in different languages. - [Go](#go) - [Scala](#scala) +## Resources +* [Awesome web scraping](https://github.com/lorien/awesome-web-scraping) + ## Python * [Scrapy](https://github.com/scrapy/scrapy) - A fast high-level screen scraping and web crawling framework. * [django-dynamic-scraper](https://github.com/holgerd77/django-dynamic-scraper) - Creating Scrapy scrapers via the Django admin interface. From 26887a5df62774751b4fb1316517dbf19f606bc1 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 09:29:13 +0200 Subject: [PATCH 03/26] Added marginalia search --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 120cc44..1131ef5 100644 --- a/README.md +++ b/README.md @@ -68,6 +68,7 @@ A collection of awesome web crawler,spider and resources in different languages. * [webBee](https://github.com/pkwenda/webBee) - A DFS web spider. * [spider-flow](https://github.com/ssssssss-team/spider-flow) - A visual spider framework, it's so good that you don't need to write any code to crawl the website. * [Norconex Web Crawler](https://github.com/Norconex/collector-http) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications. +* [Marginalia search](https://github.com/MarginaliaSearch/MarginaliaSearch) - Marginalia search crawler ## C# From b09e7f5f38c52a3239bc03ca366cb6f2a5e5e1c3 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 09:46:41 +0200 Subject: [PATCH 04/26] Added tiny web crawler --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 1131ef5..f294178 100644 --- a/README.md +++ b/README.md @@ -45,7 +45,8 @@ A collection of awesome web crawler,spider and resources in different languages. * [sukhoi](https://github.com/iogf/sukhoi) - Minimalist and powerful Web Crawler. * [spidy](https://github.com/rivermont/spidy) - The simple, easy to use command line web crawler. * [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3 -* [aspider](https://github.com/howie6879/aspider) - An async web scraping micro-framework based on asyncio. +* [aspider](https://github.com/howie6879/aspider) - An async web scraping micro-framework based on asyncio +* [tiny-web-crawler](https://github.com/indrajithi/tiny-web-crawler) - A simple and efficient web crawler for Python. * [crawler-buddy](https://github.com/rumca-js/crawler-buddy) - A crawling server with JSON interface ## Java From 970bdc5b60a59ca27e7d658ad8fae9f681dbf990 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 09:47:46 +0200 Subject: [PATCH 05/26] Added sasori --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f294178..2aeb4ae 100644 --- a/README.md +++ b/README.md @@ -95,7 +95,7 @@ A collection of awesome web crawler,spider and resources in different languages. * [headless-chrome-crawler](https://github.com/yujiosaka/headless-chrome-crawler) - Headless Chrome crawls with jQuery support * [Squidwarc](https://github.com/n0tan3rd/squidwarc) - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head * [crawlee](https://github.com/apify/crawlee) - A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - +* [Sasori](https://github.com/karthikuj/sasori) - A dynamic web crawler built on Puppeteer with support for authentication. ## PHP * [Goutte](https://github.com/FriendsOfPHP/Goutte) - A screen scraping and web crawling library for PHP. From b9010762ab48ce170709a6b428e462f037705ac6 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 09:48:28 +0200 Subject: [PATCH 06/26] Added Scrapegraph-ai --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2aeb4ae..3ade863 100644 --- a/README.md +++ b/README.md @@ -70,7 +70,7 @@ A collection of awesome web crawler,spider and resources in different languages. * [spider-flow](https://github.com/ssssssss-team/spider-flow) - A visual spider framework, it's so good that you don't need to write any code to crawl the website. * [Norconex Web Crawler](https://github.com/Norconex/collector-http) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications. * [Marginalia search](https://github.com/MarginaliaSearch/MarginaliaSearch) - Marginalia search crawler - +* [Scrapegraph-ai](https://github.com/VinciGit00/Scrapegraph-ai) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications. ## C# * [ccrawler](http://www.findbestopensource.com/product/ccrawler) - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content. From 5e216a517e11b3d929abd7a75bcd7b017b19043a Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 09:49:27 +0200 Subject: [PATCH 07/26] Added botasaurus --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 3ade863..a8b22df 100644 --- a/README.md +++ b/README.md @@ -48,6 +48,7 @@ A collection of awesome web crawler,spider and resources in different languages. * [aspider](https://github.com/howie6879/aspider) - An async web scraping micro-framework based on asyncio * [tiny-web-crawler](https://github.com/indrajithi/tiny-web-crawler) - A simple and efficient web crawler for Python. * [crawler-buddy](https://github.com/rumca-js/crawler-buddy) - A crawling server with JSON interface +* [Botasaurus](https://github.com/omkarcloud/botasaurus) ## Java * [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search. From ff6ca6693cd92060820dd0eccc7b96284a634ba1 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 09:50:25 +0200 Subject: [PATCH 08/26] Added spider suite --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index a8b22df..0760398 100644 --- a/README.md +++ b/README.md @@ -111,6 +111,7 @@ A collection of awesome web crawler,spider and resources in different languages. ## C++ * [open-source-search-engine](https://github.com/gigablast/open-source-search-engine) - A distributed open source search engine and spider/crawler written in C/C++. +* [SpiderSuite](https://github.com/3nock/SpiderSuite) - An advance, cross-platform web security crawler. Built using C++ Qt framework. ## C * [httrack](https://github.com/xroche/httrack) - Copy websites to your computer. From c512b23e1667fb8e1c86426857e92f9be9ea597f Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 09:52:18 +0200 Subject: [PATCH 09/26] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 0760398..ba51088 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ # Awesome-crawler ![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg) A collection of awesome web crawler,spider and resources in different languages. +There is already awesome web scraping information, but web crawling is something slightly different, often more ethical thing. +We do not have to duplicate awesome web scraping data. Therefore we provide something that adds value. ## Contents - [Resources](#resources) From 1a992de89ad6ff85ec77d18c73e40de32ddf6a45 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 14:02:20 +0200 Subject: [PATCH 10/26] Update README.md --- README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index ba51088..dc727cb 100644 --- a/README.md +++ b/README.md @@ -22,6 +22,12 @@ We do not have to duplicate awesome web scraping data. Therefore we provide some ## Resources * [Awesome web scraping](https://github.com/lorien/awesome-web-scraping) +* [Common Crawl](https://commoncrawl.org/) +* [Internet Archive](https://www.archive.org/) +* [Anna's Archive](https://annas-archive.org/) +* [Sci Hub](https://sci-hub.se/) +* [https://internet-in-a-box.org/](Internet in a box) +* [Internet Places Database](https://github.com/rumca-js/Internet-Places-Database) - crawled domain metadata archive ## Python * [Scrapy](https://github.com/scrapy/scrapy) - A fast high-level screen scraping and web crawling framework. @@ -34,7 +40,6 @@ We do not have to duplicate awesome web scraping data. Therefore we provide some * [cola](https://github.com/chineking/cola) - A distributed crawling framework. * [Demiurge](https://github.com/matiasb/demiurge) - PyQuery-based scraping micro-framework. * [Scrapely](https://github.com/scrapy/scrapely) - A pure-python HTML screen-scraping library. -* [feedparser](http://pythonhosted.org/feedparser/) - Universal feed parser. * [you-get](https://github.com/soimort/you-get) - Dumb downloader that scrapes the web. * [MechanicalSoup](https://github.com/hickford/MechanicalSoup) - A Python library for automating interaction with websites. * [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy. @@ -154,11 +159,7 @@ We do not have to duplicate awesome web scraping data. Therefore we provide some * [Dataflow kit](https://github.com/slotix/dataflowkit) - Extract structured data from web pages. Web sites scraping. * [Hakrawler](https://github.com/hakluke/hakrawler) - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application - ## Scala * [crawler](https://github.com/bplawler/crawler) - Scala DSL for web crawling. * [scrala](https://github.com/gaocegege/scrala) - Scala crawler(spider) framework, inspired by scrapy. * [ferrit](https://github.com/reggoodwin/ferrit) - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra. - -## Crawl Archives -* [Internet Places Database](https://github.com/rumca-js/Internet-Places-Database) - crawled domain metadata archive From cb0131bc455f06a5c22a76e9b31bcb7efc87bb7e Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 14:03:24 +0200 Subject: [PATCH 11/26] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index dc727cb..40e20ba 100644 --- a/README.md +++ b/README.md @@ -28,6 +28,8 @@ We do not have to duplicate awesome web scraping data. Therefore we provide some * [Sci Hub](https://sci-hub.se/) * [https://internet-in-a-box.org/](Internet in a box) * [Internet Places Database](https://github.com/rumca-js/Internet-Places-Database) - crawled domain metadata archive +* [Open Page Rank API](https://publicapi.dev/open-page-rank-api) +* [Page Rank](https://en.wikipedia.org/wiki/PageRank) ## Python * [Scrapy](https://github.com/scrapy/scrapy) - A fast high-level screen scraping and web crawling framework. From 0fd62e889b0d658bb4a2d7f5a5c295d670aed619 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 14:06:04 +0200 Subject: [PATCH 12/26] Update README.md --- README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 40e20ba..17efbdd 100644 --- a/README.md +++ b/README.md @@ -21,13 +21,15 @@ We do not have to duplicate awesome web scraping data. Therefore we provide some - [Scala](#scala) ## Resources -* [Awesome web scraping](https://github.com/lorien/awesome-web-scraping) +* [Awesome web scraping](https://github.com/lorien/awesome-web-scraping) - resources about web scraping * [Common Crawl](https://commoncrawl.org/) * [Internet Archive](https://www.archive.org/) * [Anna's Archive](https://annas-archive.org/) * [Sci Hub](https://sci-hub.se/) -* [https://internet-in-a-box.org/](Internet in a box) +* [Internet in a box](https://internet-in-a-box.org/) * [Internet Places Database](https://github.com/rumca-js/Internet-Places-Database) - crawled domain metadata archive + +Documents, papers * [Open Page Rank API](https://publicapi.dev/open-page-rank-api) * [Page Rank](https://en.wikipedia.org/wiki/PageRank) From bdfa189b1cf83c70213292069228fa0883ae6463 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 14:10:56 +0200 Subject: [PATCH 13/26] Update README.md --- README.md | 29 +++-------------------------- 1 file changed, 3 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index 17efbdd..8de9b32 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,6 @@ # Awesome-crawler ![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg) -A collection of awesome web crawler,spider and resources in different languages. +Information about web crawling and search engines (about how to access crawled pages). + There is already awesome web scraping information, but web crawling is something slightly different, often more ethical thing. We do not have to duplicate awesome web scraping data. Therefore we provide something that adds value. @@ -34,39 +35,22 @@ Documents, papers * [Page Rank](https://en.wikipedia.org/wiki/PageRank) ## Python -* [Scrapy](https://github.com/scrapy/scrapy) - A fast high-level screen scraping and web crawling framework. - * [django-dynamic-scraper](https://github.com/holgerd77/django-dynamic-scraper) - Creating Scrapy scrapers via the Django admin interface. - * [Scrapy-Redis](https://github.com/rolando/scrapy-redis) - Redis-based components for Scrapy. - * [scrapy-cluster](https://github.com/istresearch/scrapy-cluster) - Uses Redis and Kafka to create a distributed on demand scraping cluster. - * [distribute_crawler](https://github.com/gnemoug/distribute_crawler) - Uses scrapy,redis, mongodb,graphite to create a distributed spider. -* [pyspider](https://github.com/binux/pyspider) - A powerful spider system. +* [distribute_crawler](https://github.com/gnemoug/distribute_crawler) - Uses scrapy,redis, mongodb,graphite to create a distributed spider. * [CoCrawler](https://github.com/cocrawler/cocrawler) - A versatile web crawler built using modern tools and concurrency. * [cola](https://github.com/chineking/cola) - A distributed crawling framework. -* [Demiurge](https://github.com/matiasb/demiurge) - PyQuery-based scraping micro-framework. -* [Scrapely](https://github.com/scrapy/scrapely) - A pure-python HTML screen-scraping library. -* [you-get](https://github.com/soimort/you-get) - Dumb downloader that scrapes the web. -* [MechanicalSoup](https://github.com/hickford/MechanicalSoup) - A Python library for automating interaction with websites. -* [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy. * [crawley](https://github.com/jmg/crawley) - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. -* [RoboBrowser](https://github.com/jmcarp/robobrowser) - A simple, Pythonic library for browsing the web without a standalone web browser. -* [MSpider](https://github.com/manning23/MSpider) - A simple ,easy spider using gevent and js render. -* [brownant](https://github.com/douban/brownant) - A lightweight web data extracting framework. -* [PSpider](https://github.com/xianhu/PSpider) - A simple spider frame in Python3. * [Gain](https://github.com/gaojiuli/gain) - Web crawling framework based on asyncio for everyone. * [sukhoi](https://github.com/iogf/sukhoi) - Minimalist and powerful Web Crawler. * [spidy](https://github.com/rivermont/spidy) - The simple, easy to use command line web crawler. * [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3 -* [aspider](https://github.com/howie6879/aspider) - An async web scraping micro-framework based on asyncio * [tiny-web-crawler](https://github.com/indrajithi/tiny-web-crawler) - A simple and efficient web crawler for Python. * [crawler-buddy](https://github.com/rumca-js/crawler-buddy) - A crawling server with JSON interface -* [Botasaurus](https://github.com/omkarcloud/botasaurus) ## Java * [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search. * [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment. * [anthelion](https://github.com/yahoo/anthelion) - A plugin for Apache Nutch to crawl semantic annotations within HTML pages. * [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler. -* [JSoup](http://jsoup.org/) - Scrapes, parses, manipulates and cleans HTML. * [websphinx](http://www.cs.cmu.edu/~rcm/websphinx/) - Website-Specific Processors for HTML information extraction. * [Open Search Server](http://www.opensearchserver.com/) - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything. * [Gecco](https://github.com/xtuhcy/gecco) - A easy to use lightweight web crawler @@ -79,7 +63,6 @@ Documents, papers * [StormCrawler](http://github.com/DigitalPebble/storm-crawler/) - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm * [Spark-Crawler](https://github.com/USCDataScience/sparkler) - Evolving Apache Nutch to run on Spark. * [webBee](https://github.com/pkwenda/webBee) - A DFS web spider. -* [spider-flow](https://github.com/ssssssss-team/spider-flow) - A visual spider framework, it's so good that you don't need to write any code to crawl the website. * [Norconex Web Crawler](https://github.com/Norconex/collector-http) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications. * [Marginalia search](https://github.com/MarginaliaSearch/MarginaliaSearch) - Marginalia search crawler * [Scrapegraph-ai](https://github.com/VinciGit00/Scrapegraph-ai) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications. @@ -87,10 +70,8 @@ Documents, papers ## C# * [ccrawler](http://www.findbestopensource.com/product/ccrawler) - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content. * [SimpleCrawler](https://github.com/lei-zhu/SimpleCrawler) - Simple spider base on mutithreading, regluar expression. -* [DotnetSpider](https://github.com/zlzforever/DotnetSpider) - This is a cross platfrom, ligth spider develop by C#. * [Abot](https://github.com/sjdirect/abot) - C# web crawler built for speed and flexibility. * [Hawk](https://github.com/ferventdesert/Hawk) - Advanced Crawler and ETL tool written in C#/WPF. -* [SkyScraper](https://github.com/JonCanning/SkyScraper) - An asynchronous web scraper / web crawler using async / await and Reactive Extensions. * [Infinity Crawler](https://github.com/TurnerSoftware/InfinityCrawler) - A simple but powerful web crawler library in C#. ## JavaScript @@ -124,15 +105,11 @@ Documents, papers * [open-source-search-engine](https://github.com/gigablast/open-source-search-engine) - A distributed open source search engine and spider/crawler written in C/C++. * [SpiderSuite](https://github.com/3nock/SpiderSuite) - An advance, cross-platform web security crawler. Built using C++ Qt framework. -## C -* [httrack](https://github.com/xroche/httrack) - Copy websites to your computer. - ## Ruby * [Nokogiri](https://github.com/sparklemotion/nokogiri) - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support. * [upton](https://github.com/propublica/upton) - A batteries-included framework for easy web-scraping. Just add CSS(Or do more). * [wombat](https://github.com/felipecsl/wombat) - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages. * [RubyRetriever](https://github.com/joenorton/rubyretriever) - RubyRetriever is a Web Crawler, Scraper & File Harvester. -* [Spidr](https://github.com/postmodern/spidr) - Spider a site, multiple domains, certain links or infinitely. * [Cobweb](https://github.com/stewartmckee/cobweb) - Web crawler with very flexible crawling options, standalone or using sidekiq. * [mechanize](https://github.com/sparklemotion/mechanize) - Automated web interaction & crawling. From 984c4b129c9ac21138daf178b3f5f5bcf40c7a1f Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:06:34 +0200 Subject: [PATCH 14/26] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 8de9b32..8b12862 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,6 @@ Documents, papers ## Python * [distribute_crawler](https://github.com/gnemoug/distribute_crawler) - Uses scrapy,redis, mongodb,graphite to create a distributed spider. -* [CoCrawler](https://github.com/cocrawler/cocrawler) - A versatile web crawler built using modern tools and concurrency. * [cola](https://github.com/chineking/cola) - A distributed crawling framework. * [crawley](https://github.com/jmg/crawley) - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. * [Gain](https://github.com/gaojiuli/gain) - Web crawling framework based on asyncio for everyone. @@ -45,6 +44,7 @@ Documents, papers * [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3 * [tiny-web-crawler](https://github.com/indrajithi/tiny-web-crawler) - A simple and efficient web crawler for Python. * [crawler-buddy](https://github.com/rumca-js/crawler-buddy) - A crawling server with JSON interface +* [CoCrawler](https://github.com/cocrawler/cocrawler) - A versatile web crawler built using modern tools and concurrency. ## Java * [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search. From 0913ef873ce58f5e408ca0737d7fd64fdf4f4db5 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:08:10 +0200 Subject: [PATCH 15/26] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 8b12862..c4ab0e5 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,6 @@ Documents, papers * [Gain](https://github.com/gaojiuli/gain) - Web crawling framework based on asyncio for everyone. * [sukhoi](https://github.com/iogf/sukhoi) - Minimalist and powerful Web Crawler. * [spidy](https://github.com/rivermont/spidy) - The simple, easy to use command line web crawler. -* [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3 * [tiny-web-crawler](https://github.com/indrajithi/tiny-web-crawler) - A simple and efficient web crawler for Python. * [crawler-buddy](https://github.com/rumca-js/crawler-buddy) - A crawling server with JSON interface * [CoCrawler](https://github.com/cocrawler/cocrawler) - A versatile web crawler built using modern tools and concurrency. @@ -144,3 +143,6 @@ Documents, papers * [crawler](https://github.com/bplawler/crawler) - Scala DSL for web crawling. * [scrala](https://github.com/gaocegege/scrala) - Scala crawler(spider) framework, inspired by scrapy. * [ferrit](https://github.com/reggoodwin/ferrit) - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra. + +## Libraries +* [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3 From a98b0116f95181a44ecca13fdc9a400b57bbb166 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:08:42 +0200 Subject: [PATCH 16/26] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c4ab0e5..ed7d250 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ Documents, papers * [Gain](https://github.com/gaojiuli/gain) - Web crawling framework based on asyncio for everyone. * [sukhoi](https://github.com/iogf/sukhoi) - Minimalist and powerful Web Crawler. * [spidy](https://github.com/rivermont/spidy) - The simple, easy to use command line web crawler. -* [tiny-web-crawler](https://github.com/indrajithi/tiny-web-crawler) - A simple and efficient web crawler for Python. +* [tiny-web-crawler](https://github.com/DataCrawl-AI/datacrawl) - A simple and efficient web crawler for Python. * [crawler-buddy](https://github.com/rumca-js/crawler-buddy) - A crawling server with JSON interface * [CoCrawler](https://github.com/cocrawler/cocrawler) - A versatile web crawler built using modern tools and concurrency. From cb6a68851f2c969c6ab2eed603f2da456fd18edc Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:10:39 +0200 Subject: [PATCH 17/26] Update README.md --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index ed7d250..9481829 100644 --- a/README.md +++ b/README.md @@ -48,7 +48,6 @@ Documents, papers ## Java * [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search. * [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment. - * [anthelion](https://github.com/yahoo/anthelion) - A plugin for Apache Nutch to crawl semantic annotations within HTML pages. * [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler. * [websphinx](http://www.cs.cmu.edu/~rcm/websphinx/) - Website-Specific Processors for HTML information extraction. * [Open Search Server](http://www.opensearchserver.com/) - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything. From 391fd5d659efa477ce99257fdea4af18b7197f7d Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:11:14 +0200 Subject: [PATCH 18/26] Update README.md --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index 9481829..0b573f8 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,6 @@ Documents, papers * [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search. * [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment. * [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler. -* [websphinx](http://www.cs.cmu.edu/~rcm/websphinx/) - Website-Specific Processors for HTML information extraction. * [Open Search Server](http://www.opensearchserver.com/) - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything. * [Gecco](https://github.com/xtuhcy/gecco) - A easy to use lightweight web crawler * [WebCollector](https://github.com/CrawlScript/WebCollector) - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. From 332ab4eff995d9d4306e2a358e5aac5aa1a53ae5 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:21:56 +0200 Subject: [PATCH 19/26] Update README.md --- README.md | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/README.md b/README.md index 0b573f8..ae01c60 100644 --- a/README.md +++ b/README.md @@ -114,15 +114,9 @@ Documents, papers * [spider](https://github.com/spider-rs/spider) - The fastest web crawler and indexer. * [crawler](https://github.com/a11ywatch/crawler) - A gRPC web indexer turbo charged for performance. -## R -* [rvest](https://github.com/hadley/rvest) - Simple web scraping for R. - ## Erlang * [ebot](https://github.com/matteoredaelli/ebot) - A scalable, distribuited and highly configurable web cawler. -## Perl -* [web-scraper](https://github.com/miyagawa/web-scraper) - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions. - ## Go * [pholcus](https://github.com/henrylee2cn/pholcus) - A distributed, high concurrency and powerful web crawler. * [gocrawl](https://github.com/PuerkitoBio/gocrawl) - Polite, slim and concurrent web crawler. @@ -130,11 +124,7 @@ Documents, papers * [go_spider](https://github.com/hu17889/go_spider) - An awesome Go concurrent Crawler(spider) framework. * [dht](https://github.com/shiyanhui/dht) - BitTorrent DHT Protocol && DHT Spider. * [ants-go](https://github.com/wcong/ants-go) - A open source, distributed, restful crawler engine in golang. -* [scrape](https://github.com/yhat/scrape) - A simple, higher level interface for Go web scraping. * [creeper](https://github.com/wspl/creeper) - The Next Generation Crawler Framework (Go). -* [colly](https://github.com/asciimoo/colly) - Fast and Elegant Scraping Framework for Gophers. -* [ferret](https://github.com/MontFerret/ferret) - Declarative web scraping. -* [Dataflow kit](https://github.com/slotix/dataflowkit) - Extract structured data from web pages. Web sites scraping. * [Hakrawler](https://github.com/hakluke/hakrawler) - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application ## Scala From 1a3a4b73d58578f1e8028ad5f38812fef440e33f Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:23:21 +0200 Subject: [PATCH 20/26] Update README.md --- README.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/README.md b/README.md index ae01c60..9855525 100644 --- a/README.md +++ b/README.md @@ -72,19 +72,14 @@ Documents, papers * [Infinity Crawler](https://github.com/TurnerSoftware/InfinityCrawler) - A simple but powerful web crawler library in C#. ## JavaScript -* [scraperjs](https://github.com/ruipgil/scraperjs) - A complete and versatile web scraper. -* [scrape-it](https://github.com/IonicaBizau/scrape-it) - A Node.js scraper for humans. * [simplecrawler](https://github.com/cgiffard/node-simplecrawler) - Event driven web crawler. * [node-crawler](https://github.com/bda-research/node-crawler) - Node-crawler has clean,simple api. * [js-crawler](https://github.com/antivanov/js-crawler) - Web crawler for Node.JS, both HTTP and HTTPS are supported. * [webster](https://github.com/zhuyingda/webster) - A reliable web crawling framework which can scrape ajax and js rendered content in a web page. * [x-ray](https://github.com/lapwinglabs/x-ray) - Web scraper with pagination and crawler support. -* [node-osmosis](https://github.com/rchipka/node-osmosis) - HTML/XML parser and web scraper for Node.js. -* [web-scraper-chrome-extension](https://github.com/martinsbalodis/web-scraper-chrome-extension) - Web data extraction tool implemented as chrome extension. * [supercrawler](https://github.com/brendonboshell/supercrawler) - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. * [headless-chrome-crawler](https://github.com/yujiosaka/headless-chrome-crawler) - Headless Chrome crawls with jQuery support * [Squidwarc](https://github.com/n0tan3rd/squidwarc) - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head -* [crawlee](https://github.com/apify/crawlee) - A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. * [Sasori](https://github.com/karthikuj/sasori) - A dynamic web crawler built on Puppeteer with support for authentication. ## PHP From 592f75244dba2c283d274d708cbb2175c54ae0b0 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:24:18 +0200 Subject: [PATCH 21/26] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9855525..e60d2e0 100644 --- a/README.md +++ b/README.md @@ -46,6 +46,7 @@ Documents, papers * [CoCrawler](https://github.com/cocrawler/cocrawler) - A versatile web crawler built using modern tools and concurrency. ## Java +* [Marginalia search](https://github.com/MarginaliaSearch/MarginaliaSearch) - Marginalia search crawler * [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search. * [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment. * [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler. @@ -61,7 +62,6 @@ Documents, papers * [Spark-Crawler](https://github.com/USCDataScience/sparkler) - Evolving Apache Nutch to run on Spark. * [webBee](https://github.com/pkwenda/webBee) - A DFS web spider. * [Norconex Web Crawler](https://github.com/Norconex/collector-http) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications. -* [Marginalia search](https://github.com/MarginaliaSearch/MarginaliaSearch) - Marginalia search crawler * [Scrapegraph-ai](https://github.com/VinciGit00/Scrapegraph-ai) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications. ## C# From 861d893986362f5f053bfd62a8b01566c707a5bf Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:25:24 +0200 Subject: [PATCH 22/26] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index e60d2e0..b1a787f 100644 --- a/README.md +++ b/README.md @@ -36,6 +36,7 @@ Documents, papers ## Python * [distribute_crawler](https://github.com/gnemoug/distribute_crawler) - Uses scrapy,redis, mongodb,graphite to create a distributed spider. +* [crawl4ai](https://github.com/unclecode/crawl4ai) * [cola](https://github.com/chineking/cola) - A distributed crawling framework. * [crawley](https://github.com/jmg/crawley) - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. * [Gain](https://github.com/gaojiuli/gain) - Web crawling framework based on asyncio for everyone. From dfa4e1c7624e7a51e2c4b41f3839ddd74bcd71c9 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:26:53 +0200 Subject: [PATCH 23/26] Update README.md --- README.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index b1a787f..db92aae 100644 --- a/README.md +++ b/README.md @@ -12,14 +12,13 @@ We do not have to duplicate awesome web scraping data. Therefore we provide some - [JavaScript](#javascript) - [PHP](#php) - [C++](#c-1) -- [C](#c-2) - [Ruby](#ruby) - [Rust](#rust) -- [R](#r) - [Erlang](#erlang) -- [Perl](#perl) - [Go](#go) - [Scala](#scala) +- [Other](#other) +- [Libraries](#libraries) ## Resources * [Awesome web scraping](https://github.com/lorien/awesome-web-scraping) - resources about web scraping @@ -128,5 +127,8 @@ Documents, papers * [scrala](https://github.com/gaocegege/scrala) - Scala crawler(spider) framework, inspired by scrapy. * [ferrit](https://github.com/reggoodwin/ferrit) - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra. +## Other +* [Easy Spider](https://github.com/NaiboWang/EasySpider) + ## Libraries * [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3 From 5ad462128d053755fc8e6b4a7833af9ff9272d63 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:27:29 +0200 Subject: [PATCH 24/26] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index db92aae..989fb92 100644 --- a/README.md +++ b/README.md @@ -49,6 +49,7 @@ Documents, papers * [Marginalia search](https://github.com/MarginaliaSearch/MarginaliaSearch) - Marginalia search crawler * [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search. * [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment. +* [Apache Storm](https://stormcrawler.apache.org/) * [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler. * [Open Search Server](http://www.opensearchserver.com/) - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything. * [Gecco](https://github.com/xtuhcy/gecco) - A easy to use lightweight web crawler From 0bdad428d63cd4e8cbf4245b2c30efc576a2d1c4 Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:29:02 +0200 Subject: [PATCH 25/26] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 989fb92..55460ae 100644 --- a/README.md +++ b/README.md @@ -73,6 +73,7 @@ Documents, papers * [Infinity Crawler](https://github.com/TurnerSoftware/InfinityCrawler) - A simple but powerful web crawler library in C#. ## JavaScript +* [browsertrix crawler](https://github.com/webrecorder/browsertrix-crawler) * [simplecrawler](https://github.com/cgiffard/node-simplecrawler) - Event driven web crawler. * [node-crawler](https://github.com/bda-research/node-crawler) - Node-crawler has clean,simple api. * [js-crawler](https://github.com/antivanov/js-crawler) - Web crawler for Node.JS, both HTTP and HTTPS are supported. From ad26f1968af47344cc7e33bb1c6c9d2f152e59cd Mon Sep 17 00:00:00 2001 From: X Y Z Date: Tue, 17 Jun 2025 15:29:50 +0200 Subject: [PATCH 26/26] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 55460ae..f4f0eca 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ We do not have to duplicate awesome web scraping data. Therefore we provide some - [Libraries](#libraries) ## Resources -* [Awesome web scraping](https://github.com/lorien/awesome-web-scraping) - resources about web scraping +* [Awesome web scraping](https://github.com/lorien/awesome-web-scraping) - resources about web scraping, tools, frameworks * [Common Crawl](https://commoncrawl.org/) * [Internet Archive](https://www.archive.org/) * [Anna's Archive](https://annas-archive.org/)