geziyor/README.md
Ibrahim Serdar Acikgoz 7360ffa3c9
Update README.md
2019-06-14 14:57:53 +03:00

2.0 KiB

Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.

GoDoc report card

Features

  • 1.000+ Requests/Sec
  • JS Rendering
  • Caching (Memory/Disk)
  • Automatic Data Exporting (JSON, CSV, or custom)
  • Limit Concurrency (Global/Per Domain)
  • Request Delays (Constant/Randomized)
  • Automatic response decoding to UTF-8

See scraper Options for all custom settings.

Status

Since the project is in development phase, API may change in time. Also, we highly recommend you to use Geziyor with go modules.

Usage

Simple usage

geziyor.NewGeziyor(geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(r *geziyor.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Advanced usage

func main() {
    geziyor.NewGeziyor(geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []geziyor.Exporter{exporter.JSONExporter{}},
    }).Start()
}

func quotesParse(r *geziyor.Response) {
    r.DocHTML.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        r.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.DocHTML.Find("li.next > a").Attr("href"); ok {
        go r.Geziyor.Get(r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Installation

go get github.com/geziyor/geziyor