Commit Graph

126 Commits

Author SHA1 Message Date
f35d34bc02 chromedp library updated. 2021-05-23 23:14:47 +03:00
16265e524d Response.JoinURL simplified. 2021-05-18 13:31:23 +03:00
3c9a3849e2 Start command now waits for synchronized requests too. This fixes if requests are made using different goroutines with synchronized requests.
It doesn't cause any issues on concurrent requests because we already wait for them.
2021-04-19 12:58:47 +03:00
d28beca57a Fix race condition on hosts semaphore 2021-04-17 14:46:45 +03:00
c527d0b885 SIGINT (interrupt) signal receiving refactored and fixed working on some conditions 2021-04-17 14:11:17 +03:00
6a23efd175 JoinURL now returns *url.URL and error 2021-04-17 11:12:22 +03:00
9ea67b3554 Use fmt.Errorf instead of errors package. This is good convention after go 1.13 2021-04-17 11:11:29 +03:00
fbee722a38 Rate limiting per second implemented 2021-04-16 15:31:31 +03:00
d8252092f7 Add duplicate_requests_test.go 2021-04-16 14:43:42 +03:00
be4d13c0ef Retry checking refactored using util function. 2021-04-14 09:32:42 +03:00
46c4db6b1a Exporters now need to return error. This is done because of simple error logging. 2021-04-14 09:30:17 +03:00
e3d79e2574 Added custom logger. Right now, not configurable. 2021-04-13 23:36:42 +03:00
129402d754 Updated chromedp 2021-01-28 20:50:25 +03:00
9b266b6cce Allocator options added 2021-01-28 20:49:01 +03:00
29c29235ae Fixed response error if retrying disabled 2020-09-05 17:24:22 +03:00
7a76a9b95e Allocators seperated for transparency. Updated chrome library. 2020-09-05 16:14:41 +03:00
cfb16fe1ee Call ErrorFunc on errors. Unexport DoRequestClient and DoRequestChrome 2019-12-13 00:03:44 +03:00
7d2fe57bab Added error logging for HTML parser. 2019-12-11 13:55:38 +03:00
cbca22fefb Updated chrome protocol library 2019-11-16 20:34:57 +03:00
6645820408 Added logging on allowed domains middleware and duplicate requests 2019-11-16 20:34:09 +03:00
9b8a3837bd Added response joinURL test and updated chromedp. 2019-09-13 14:34:29 +03:00
3264057679 Fixed issue on JoinURL 2019-08-06 17:21:41 +03:00
86d4e80596 Added user-agent test, Fixed failing test 2019-08-05 16:18:44 +03:00
85597219e6 Refactored client options
Fixed default User-Agent string not being set.
2019-08-05 15:42:30 +03:00
0e5230eac8 Remote endpoint support added for js rendered requests. Geziyor is beta now. 2019-08-05 15:14:47 +03:00
c117d71fef Updated license 2019-08-05 15:01:48 +03:00
32077d8433 Updated docs for rendered requests 2019-07-26 16:40:42 +03:00
e07ef4d66d Fixed important bug on rendering that was causing client request made too. Updated chromedp dependency 2019-07-26 16:07:09 +03:00
762854e511 Go 1.10 and 1.11 support added by using different methods on reflect package. 2019-07-21 12:08:41 +03:00
df37629d4d Disabled indenting on JSON exporter as it looks so ugly on exported data.
JSONLine still supports indenting.
2019-07-14 03:37:52 +03:00
dfabcb84fd JSON renamed to JSONLine. JSON List support added. 2019-07-14 03:30:59 +03:00
d19465c44a Robotstxt metrics added. 2019-07-08 14:51:54 +03:00
d3c4389c46 Retrying support added for chrome. Fixed robots.txt retry issue. Fixed Meta issue 2019-07-07 19:50:15 +03:00
90d2be2210 Caching policies added.
We used httpcache library to implement this. As it was not possible to support different policies, I mostly copied and modified it.
2019-07-07 12:18:40 +03:00
0d6c2a6864 Graceful shut down system implemented 2019-07-06 18:32:13 +03:00
42faa92ece Robots.txt support implemented 2019-07-06 16:18:03 +03:00
2cab68d2ce Middlewares refactored to multiple files in middleware package.
Extractors removed as they introduce complexity to scraper. Both in learning and developing.
2019-07-04 21:04:29 +03:00
9adff75509 Retry requests support implemented for client. 2019-07-04 13:36:10 +03:00
da03567fae Extractors refactored to support pass by value. Documentation added for request and response. 2019-07-04 02:13:29 +03:00
71683ec6de Chardet removed as its not good enough to detect. Built-int library is good enough. 2019-07-03 20:54:17 +03:00
33238bc875 Charset detection heuristics added with chardet lib. 2019-07-03 18:08:28 +03:00
b355a566cf Added more tests and refactored exporter tests. Added code coverage badge. 2019-07-02 14:53:06 +03:00
4ab7cfd904 Exporter and Extractor interfaces moved to its own package for simplicity of main Geziyor package 2019-07-02 13:22:23 +03:00
c0dd0393e6 Maximum redirection option added. Performance improvement on exports. Duplicate requests only checked on GET requests. 2019-07-01 15:44:28 +03:00
80f3500a69 Fixed Chrome response not right on some sites. 2019-07-01 12:32:15 +03:00
fb5b4e3406 README updated according to new package names 2019-06-30 22:21:36 +03:00
0eda056065 Attribute extractor added. HTML extractor added. Outer HTML Extractor added.
exporter package renamed to export, extractor package renamed to extract for simplicity.
2019-06-30 22:20:17 +03:00
7c383b175f Metrics Server support added for expvar. Refactored some methods. 2019-06-30 19:09:03 +03:00
ec4551a8a0 Making Requests and reading responses refactored to client package. 2019-06-30 16:21:18 +03:00
0eac5f5f40 Fixed exporters bug that was causing last exported items not written to disk. 2019-06-29 16:11:52 +03:00