Robots.txt support implemented

This commit is contained in:
Musab Gültekin
2019-07-06 16:18:03 +03:00
parent 2cab68d2ce
commit 42faa92ece
9 changed files with 154 additions and 64 deletions

View File

@ -9,12 +9,11 @@ Geziyor is a blazing fast web crawling and web scraping framework. It can be use
- 5.000+ Requests/Sec
- JS Rendering
- Caching (Memory/Disk)
- Automatic Data Extracting (CSS Selectors)
- Automatic Data Exporting (JSON, CSV, or custom)
- Metrics (Prometheus, Expvar, or custom)
- Limit Concurrency (Global/Per Domain)
- Request Delays (Constant/Randomized)
- Cookies and Middlewares
- Cookies, Middlewares, robots.txt
- Automatic response decoding to UTF-8
See scraper [Options](https://godoc.org/github.com/geziyor/geziyor#Options) for all custom settings.
@ -64,6 +63,8 @@ See [tests](https://github.com/geziyor/geziyor/blob/master/geziyor_test.go) for
### Installation
Go 1.12 required
go get github.com/geziyor/geziyor
**NOTE**: macOS limits the maximum number of open file descriptors.
@ -161,7 +162,6 @@ ok github.com/geziyor/geziyor 22.861s
If you're interested in helping this project, please consider these features:
- Command line tool for: pausing and resuming scraper etc. (like [this](https://docs.scrapy.org/en/latest/topics/commands.html))
- Deploying Scrapers to Cloud
- ~~Automatically exporting extracted data to multiple places (AWS, FTP, DB, JSON, CSV etc)~~
- Downloading media (Images, Videos etc) (like [this](https://docs.scrapy.org/en/latest/topics/media-pipeline.html))
- ~~Realtime metrics (Prometheus etc.)~~