Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Website subscriber implementation details, regarding on_should_crawl_callback #242

Closed
shroom00 opened this issue Jan 4, 2025 · 2 comments

Comments

@shroom00
Copy link

shroom00 commented Jan 4, 2025

Re: #241

Let's say I don't want to visit pages that match some condition, but I still want to visit the root, regardless.
If on_should_crawl_callback stops a page being crawled further, no links on it are visited, saving bandwidth. However, there's no way to tell if the site was previosuly rejected by the callback when it's sent to the website subscriber. This means that any different processing you want to take place isn't possible, unless you execute the callback again, to determine if the page is allowed.

There are two ways I can think of solving this issue.

  1. Don't send rejected pages to the subscriber. This means you can specifically allow the root pages, for example, and only they will be sent to the subscriber. However, it means that links found on the page will be visited, using more bandwidth.
  2. Add some attribute to the Page struct, indicating if it was previously rejected by the callback. This way you can handle different processing steps using the subscriber, while not using more bandwidth visiting found links. This comes at the cost of a slightly more awkward/unintuitive implementation for the developer.

I personally prefer option 2, assuming it's made clear that all pages are sent to subscribers, noting the attribute to check if they were rejected. I'm open to either though, and would be happy to discuss it.

@j-mendez
Copy link
Member

j-mendez commented Jan 4, 2025

#2 option is ideal. Will add this soon.

j-mendez added a commit that referenced this issue Jan 5, 2025
@j-mendez
Copy link
Member

j-mendez commented Jan 5, 2025

Released in v2.24.1 page.blocked_crawl. Thanks!

@j-mendez j-mendez closed this as completed Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants