Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.

GoDoc report card

Features

  • 1.000+ Requests/Sec
  • JS Rendering
  • Caching (Memory/Disk)
  • Automatic Data Exporting (JSON, CSV, or custom)
  • Limit Concurrency (Global/Per Domain)
  • Request Delays (Constant/Randomized)
  • Automatic response decoding to UTF-8

See scraper Options for all custom settings.

Status

Since the project is in development phase, API may change in time. Also, we highly recommend you to use Geziyor with go modules.

Usage

Simple usage

geziyor.NewGeziyor(geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(r *geziyor.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Advanced usage

func main() {
    geziyor.NewGeziyor(geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []geziyor.Exporter{exporter.JSONExporter{}},
    }).Start()
}

func quotesParse(r *geziyor.Response) {
    r.DocHTML.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        r.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.DocHTML.Find("li.next > a").Attr("href"); ok {
        r.Geziyor.Get(r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Installation

go get github.com/geziyor/geziyor
Description
No description provided
Readme 355 KiB
Languages
Go 100%