4.1 KiB
Geziyor
Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.
Features
- 1.000+ Requests/Sec
- JS Rendering
- Caching (Memory/Disk)
- Automatic Data Exporting (JSON, CSV, or custom)
- Limit Concurrency (Global/Per Domain)
- Request Delays (Constant/Randomized)
- Automatic response decoding to UTF-8
- Middlewares
See scraper Options for all custom settings.
Status
Since the project is in development phase, API may change in time. Thus, we highly recommend you to use Geziyor with go modules.
Examples
Simple usage
geziyor.NewGeziyor(geziyor.Options{
StartURLs: []string{"http://api.ipify.org"},
ParseFunc: func(g *geziyor.Geziyor, r *geziyor.Response) {
fmt.Println(string(r.Body))
},
}).Start()
Advanced usage
func main() {
geziyor.NewGeziyor(geziyor.Options{
StartURLs: []string{"http://quotes.toscrape.com/"},
ParseFunc: quotesParse,
Exporters: []geziyor.Exporter{exporter.JSONExporter{}},
}).Start()
}
func quotesParse(g *geziyor.Geziyor, r *geziyor.Response) {
r.DocHTML.Find("div.quote").Each(func(i int, s *goquery.Selection) {
g.Exports <- map[string]interface{}{
"text": s.Find("span.text").Text(),
"author": s.Find("small.author").Text(),
}
})
if href, ok := r.DocHTML.Find("li.next > a").Attr("href"); ok {
g.Get(r.JoinURL(href), quotesParse)
}
}
See tests for more usage examples.
Documentation
Installation
go get github.com/geziyor/geziyor
NOTE: macOS limits the maximum number of open file descriptors. If you want to make concurrent requests over 256, you need to increase limits. Read this for more.
Making Requests
Initial requests start with StartURLs []string
field in Options
.
Geziyor makes concurrent requests to those URLs.
After reading response, ParseFunc func(g *Geziyor, r *Response)
called.
geziyor.NewGeziyor(geziyor.Options{
StartURLs: []string{"http://api.ipify.org"},
ParseFunc: func(g *geziyor.Geziyor, r *geziyor.Response) {
fmt.Println(string(r.Body))
},
}).Start()
If you want to manually create first requests, set StartRequestsFunc
.
StartURLs
won't be used if you create requests manually.
You can make following requests using Geziyor
methods:
Get
: Make GET requestGetRendered
: Make GET and render Javascript using Headless Browser. As it opens up a real browser, it takes a couple of seconds to make requests.Head
: Make HEAD requestDo
: Make custom request by providing *geziyor.Request
geziyor.NewGeziyor(geziyor.Options{
StartRequestsFunc: func(g *geziyor.Geziyor) {
g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
g.Head("https://httpbin.org/anything", g.Opt.ParseFunc)
},
ParseFunc: func(g *geziyor.Geziyor, r *geziyor.Response) {
fmt.Println(string(r.Body))
},
}).Start()
Roadmap
If you're interested in helping this project, please consider these features:
- Command line tool for: pausing and resuming scraper etc. (like this)
- Automatic item extractors (like this)
- Deploying Scrapers to Cloud
Automatically exporting extracted data to multiple places (AWS, FTP, DB, JSON, CSV etc)- Downloading media (Images, Videos etc) (like this)
- Realtime metrics (Prometheus etc.)