T

Musab Gültekin 2cab68d2ce Middlewares refactored to multiple files in middleware package.

Extractors removed as they introduce complexity to scraper. Both in learning and developing.

2019-07-04 21:04:29 +03:00

client

Retry requests support implemented for client.

2019-07-04 13:36:10 +03:00

export

Middlewares refactored to multiple files in middleware package.

2019-07-04 21:04:29 +03:00

internal

Retry requests support implemented for client.

2019-07-04 13:36:10 +03:00

metrics

Metrics Server support added for expvar. Refactored some methods.

2019-06-30 19:09:03 +03:00

middleware

Middlewares refactored to multiple files in middleware package.

2019-07-04 21:04:29 +03:00

.gitignore

Extractors implemented. Exporters name simplified. README Updated for extracting data. Removed go 1.11 support

2019-06-28 13:00:30 +03:00

.travis.yml

Extractors implemented. Exporters name simplified. README Updated for extracting data. Removed go 1.11 support

2019-06-28 13:00:30 +03:00

CONTRIBUTING.md

Update CONTRIBUTING.md

2019-06-15 18:08:27 +03:00

geziyor_test.go

Middlewares refactored to multiple files in middleware package.

2019-07-04 21:04:29 +03:00

geziyor.go

Middlewares refactored to multiple files in middleware package.

2019-07-04 21:04:29 +03:00

go.mod

Chardet removed as its not good enough to detect. Built-int library is good enough.

2019-07-03 20:54:17 +03:00

go.sum

Chardet removed as its not good enough to detect. Built-int library is good enough.

2019-07-03 20:54:17 +03:00

LICENSE.txt

Fixed license copyright

2019-06-13 15:48:02 +03:00

options.go

Middlewares refactored to multiple files in middleware package.

2019-07-04 21:04:29 +03:00

README.md

Middlewares refactored to multiple files in middleware package.

2019-07-04 21:04:29 +03:00

README.md

Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.

Features

5.000+ Requests/Sec
JS Rendering
Caching (Memory/Disk)
Automatic Data Extracting (CSS Selectors)
Automatic Data Exporting (JSON, CSV, or custom)
Metrics (Prometheus, Expvar, or custom)
Limit Concurrency (Global/Per Domain)
Request Delays (Constant/Randomized)
Cookies and Middlewares
Automatic response decoding to UTF-8

See scraper Options for all custom settings.

Status

The project is in development phase. Thus, we highly recommend you to use Geziyor with go modules.

Examples

Simple usage

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Advanced usage

func main() {
    geziyor.NewGeziyor(&geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []export.Exporter{export.JSON{}},
    }).Start()
}

func quotesParse(g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        g.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
        g.Get(r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Documentation

Installation

go get github.com/geziyor/geziyor

NOTE: macOS limits the maximum number of open file descriptors. If you want to make concurrent requests over 256, you need to increase limits. Read this for more.

Making Requests

Initial requests start with StartURLs []string field in Options. Geziyor makes concurrent requests to those URLs. After reading response, ParseFunc func(g *Geziyor, r *Response) called.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

If you want to manually create first requests, set StartRequestsFunc. StartURLs won't be used if you create requests manually.
You can make requests using Geziyor methods:

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
    	g.Get("https://httpbin.org/anything", g.Opt.ParseFunc)
        g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
        g.Head("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Extracting Data

We can extract HTML elements using response.HTMLDoc. HTMLDoc is Goquery's Document.

HTMLDoc can be accessible on Response if response is HTML and can be parsed using Go's built-in HTML parser If response isn't HTML, response.HTMLDoc would be nil.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            log.Println(s.Find("span.text").Text(), s.Find("small.author").Text())
        })
    },
}).Start()

Exporting Data

You can export data automatically using exporters. Just send data to Geziyor.Exports chan. Available exporters

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            g.Exports <- map[string]interface{}{
                "text":   s.Find("span.text").Text(),
                "author": s.Find("small.author").Text(),
            }
        })
    },
    Exporters: []export.Exporter{&export.JSON{}},
}).Start()

Benchmark

8748 request per seconds on Macbook Pro 15" 2016

See tests for this benchmark function:

>> go test -run none -bench Requests -benchtime 10s
goos: darwin
goarch: amd64
pkg: github.com/geziyor/geziyor
BenchmarkRequests-8   	  200000	    108710 ns/op
PASS
ok  	github.com/geziyor/geziyor	22.861s

Roadmap

If you're interested in helping this project, please consider these features:

Command line tool for: pausing and resuming scraper etc. (like this)
Deploying Scrapers to Cloud
~~Automatically exporting extracted data to multiple places (AWS, FTP, DB, JSON, CSV etc)~~
Downloading media (Images, Videos etc) (like this)
~~Realtime metrics (Prometheus etc.)~~