123 Commits

Author SHA1 Message Date
Musab Gültekin
d28beca57a Fix race condition on hosts semaphore 2021-04-17 14:46:45 +03:00
Musab Gültekin
c527d0b885 SIGINT (interrupt) signal receiving refactored and fixed working on some conditions 2021-04-17 14:11:17 +03:00
Musab Gültekin
6a23efd175 JoinURL now returns *url.URL and error 2021-04-17 11:12:22 +03:00
Musab Gültekin
9ea67b3554 Use fmt.Errorf instead of errors package. This is good convention after go 1.13 2021-04-17 11:11:29 +03:00
Musab Gültekin
fbee722a38 Rate limiting per second implemented 2021-04-16 15:31:31 +03:00
Musab Gültekin
d8252092f7 Add duplicate_requests_test.go 2021-04-16 14:43:42 +03:00
Musab Gültekin
be4d13c0ef Retry checking refactored using util function. 2021-04-14 09:32:42 +03:00
Musab Gültekin
46c4db6b1a Exporters now need to return error. This is done because of simple error logging. 2021-04-14 09:30:17 +03:00
Musab Gültekin
e3d79e2574 Added custom logger. Right now, not configurable. 2021-04-13 23:36:42 +03:00
Musab Gültekin
129402d754 Updated chromedp 2021-01-28 20:50:25 +03:00
Musab Gültekin
9b266b6cce Allocator options added 2021-01-28 20:49:01 +03:00
Musab Gültekin
29c29235ae Fixed response error if retrying disabled 2020-09-05 17:24:22 +03:00
Musab Gültekin
7a76a9b95e Allocators seperated for transparency. Updated chrome library. 2020-09-05 16:14:41 +03:00
Musab Gültekin
cfb16fe1ee Call ErrorFunc on errors. Unexport DoRequestClient and DoRequestChrome 2019-12-13 00:03:44 +03:00
Musab Gültekin
7d2fe57bab Added error logging for HTML parser. 2019-12-11 13:55:38 +03:00
Musab Gültekin
cbca22fefb Updated chrome protocol library 2019-11-16 20:34:57 +03:00
Musab Gültekin
6645820408 Added logging on allowed domains middleware and duplicate requests 2019-11-16 20:34:09 +03:00
Musab Gültekin
9b8a3837bd Added response joinURL test and updated chromedp. 2019-09-13 14:34:29 +03:00
Musab Gültekin
3264057679 Fixed issue on JoinURL 2019-08-06 17:21:41 +03:00
Musab Gültekin
86d4e80596 Added user-agent test, Fixed failing test 2019-08-05 16:18:44 +03:00
Musab Gültekin
85597219e6 Refactored client options
Fixed default User-Agent string not being set.
2019-08-05 15:42:30 +03:00
Musab Gültekin
0e5230eac8 Remote endpoint support added for js rendered requests. Geziyor is beta now. 2019-08-05 15:14:47 +03:00
Musab Gültekin
c117d71fef Updated license 2019-08-05 15:01:48 +03:00
Musab Gültekin
32077d8433 Updated docs for rendered requests 2019-07-26 16:40:42 +03:00
Musab Gültekin
e07ef4d66d Fixed important bug on rendering that was causing client request made too. Updated chromedp dependency 2019-07-26 16:07:09 +03:00
Musab Gültekin
762854e511 Go 1.10 and 1.11 support added by using different methods on reflect package. 2019-07-21 12:08:41 +03:00
Musab Gültekin
df37629d4d Disabled indenting on JSON exporter as it looks so ugly on exported data.
JSONLine still supports indenting.
2019-07-14 03:37:52 +03:00
Musab Gültekin
dfabcb84fd JSON renamed to JSONLine. JSON List support added. 2019-07-14 03:30:59 +03:00
Musab Gültekin
d19465c44a Robotstxt metrics added. 2019-07-08 14:51:54 +03:00
Musab Gültekin
d3c4389c46 Retrying support added for chrome. Fixed robots.txt retry issue. Fixed Meta issue 2019-07-07 19:50:15 +03:00
Musab Gültekin
90d2be2210 Caching policies added.
We used httpcache library to implement this. As it was not possible to support different policies, I mostly copied and modified it.
2019-07-07 12:18:40 +03:00
Musab Gültekin
0d6c2a6864 Graceful shut down system implemented 2019-07-06 18:32:13 +03:00
Musab Gültekin
42faa92ece Robots.txt support implemented 2019-07-06 16:18:03 +03:00
Musab Gültekin
2cab68d2ce Middlewares refactored to multiple files in middleware package.
Extractors removed as they introduce complexity to scraper. Both in learning and developing.
2019-07-04 21:04:29 +03:00
Musab Gültekin
9adff75509 Retry requests support implemented for client. 2019-07-04 13:36:10 +03:00
Musab Gültekin
da03567fae Extractors refactored to support pass by value. Documentation added for request and response. 2019-07-04 02:13:29 +03:00
Musab Gültekin
71683ec6de Chardet removed as its not good enough to detect. Built-int library is good enough. 2019-07-03 20:54:17 +03:00
Musab Gültekin
33238bc875 Charset detection heuristics added with chardet lib. 2019-07-03 18:08:28 +03:00
Musab Gültekin
b355a566cf Added more tests and refactored exporter tests. Added code coverage badge. 2019-07-02 14:53:06 +03:00
Musab Gültekin
4ab7cfd904 Exporter and Extractor interfaces moved to its own package for simplicity of main Geziyor package 2019-07-02 13:22:23 +03:00
Musab Gültekin
c0dd0393e6 Maximum redirection option added. Performance improvement on exports. Duplicate requests only checked on GET requests. 2019-07-01 15:44:28 +03:00
Musab Gültekin
80f3500a69 Fixed Chrome response not right on some sites. 2019-07-01 12:32:15 +03:00
Musab Gültekin
fb5b4e3406 README updated according to new package names 2019-06-30 22:21:36 +03:00
Musab Gültekin
0eda056065 Attribute extractor added. HTML extractor added. Outer HTML Extractor added.
exporter package renamed to export, extractor package renamed to extract for simplicity.
2019-06-30 22:20:17 +03:00
Musab Gültekin
7c383b175f Metrics Server support added for expvar. Refactored some methods. 2019-06-30 19:09:03 +03:00
Musab Gültekin
ec4551a8a0 Making Requests and reading responses refactored to client package. 2019-06-30 16:21:18 +03:00
Musab Gültekin
0eac5f5f40 Fixed exporters bug that was causing last exported items not written to disk. 2019-06-29 16:11:52 +03:00
Musab Gültekin
bd6466a5f2 http package renamed to client to reduce cunfusion 2019-06-29 14:18:31 +03:00
Musab Gültekin
1e109c555d Request and response moved to http package 2019-06-29 13:36:39 +03:00
Musab Gültekin
59757607eb Pretty print exporter added. Panic counter added to metrics 2019-06-29 11:20:06 +03:00