You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's say I don't want to visit pages that match some condition, but I still want to visit the root, regardless.
If on_should_crawl_callback stops a page being crawled further, no links on it are visited, saving bandwidth. However, there's no way to tell if the site was previosuly rejected by the callback when it's sent to the website subscriber. This means that any different processing you want to take place isn't possible, unless you execute the callback again, to determine if the page is allowed.
There are two ways I can think of solving this issue.
Don't send rejected pages to the subscriber. This means you can specifically allow the root pages, for example, and only they will be sent to the subscriber. However, it means that links found on the page will be visited, using more bandwidth.
Add some attribute to the Page struct, indicating if it was previously rejected by the callback. This way you can handle different processing steps using the subscriber, while not using more bandwidth visiting found links. This comes at the cost of a slightly more awkward/unintuitive implementation for the developer.
I personally prefer option 2, assuming it's made clear that all pages are sent to subscribers, noting the attribute to check if they were rejected. I'm open to either though, and would be happy to discuss it.
The text was updated successfully, but these errors were encountered:
Re: #241
Let's say I don't want to visit pages that match some condition, but I still want to visit the root, regardless.
If
on_should_crawl_callback
stops a page being crawled further, no links on it are visited, saving bandwidth. However, there's no way to tell if the site was previosuly rejected by the callback when it's sent to the website subscriber. This means that any different processing you want to take place isn't possible, unless you execute the callback again, to determine if the page is allowed.There are two ways I can think of solving this issue.
Page
struct, indicating if it was previously rejected by the callback. This way you can handle different processing steps using the subscriber, while not using more bandwidth visiting found links. This comes at the cost of a slightly more awkward/unintuitive implementation for the developer.I personally prefer option 2, assuming it's made clear that all pages are sent to subscribers, noting the attribute to check if they were rejected. I'm open to either though, and would be happy to discuss it.
The text was updated successfully, but these errors were encountered: