Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add callback with access to spider::page::Page #241

Closed
shroom00 opened this issue Jan 4, 2025 · 3 comments
Closed

Add callback with access to spider::page::Page #241

shroom00 opened this issue Jan 4, 2025 · 3 comments

Comments

@shroom00
Copy link

shroom00 commented Jan 4, 2025

Currently, we have on_link_find_callback, which allows links to be changed depending on some function. This has its uses, but a downside of this is that it only takes the url as a parameter. However, there are scenarios where you can only know if a page should be crawled depending on the page content.

For example, I have a web crawler that I use for search engine indexing. When crawling, I have come across pages that are alternative Reddit frontends (e.g. libreddit). With my old crawler (using Python's scrapy library) I had a callback that filtered pages depending on CSS selectors (libreddit frontends can be detected based on a logo with specific characteristics). I don't want to crawl these websites as the amount of content is huge, only the website root is of use to me.

With a new callback that takes the Page as a parameter, something like this is possible. Maybe it could be called should_crawl_callback or something like that. It could return a bool, and (dis)allow crawling links etc. found on that page.

@j-mendez
Copy link
Member

j-mendez commented Jan 4, 2025

released via 2.24.0 with website.on_should_crawl_callback, thank you!

@j-mendez j-mendez closed this as completed Jan 4, 2025
@shroom00
Copy link
Author

shroom00 commented Jan 4, 2025

I think you forgot to add the setter method! :)

@j-mendez
Copy link
Member

j-mendez commented Jan 5, 2025

I think you forgot to add the setter method! :)

released in 2.24.1 ty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants