Add callback with access to spider::page::Page #241

shroom00 · 2025-01-04T17:19:53Z

Currently, we have on_link_find_callback, which allows links to be changed depending on some function. This has its uses, but a downside of this is that it only takes the url as a parameter. However, there are scenarios where you can only know if a page should be crawled depending on the page content.

For example, I have a web crawler that I use for search engine indexing. When crawling, I have come across pages that are alternative Reddit frontends (e.g. libreddit). With my old crawler (using Python's scrapy library) I had a callback that filtered pages depending on CSS selectors (libreddit frontends can be detected based on a logo with specific characteristics). I don't want to crawl these websites as the amount of content is huge, only the website root is of use to me.

With a new callback that takes the Page as a parameter, something like this is possible. Maybe it could be called should_crawl_callback or something like that. It could return a bool, and (dis)allow crawling links etc. found on that page.

The text was updated successfully, but these errors were encountered:

j-mendez · 2025-01-04T17:52:35Z

released via 2.24.0 with website.on_should_crawl_callback, thank you!

shroom00 · 2025-01-04T18:50:50Z

I think you forgot to add the setter method! :)

j-mendez · 2025-01-05T01:42:37Z

I think you forgot to add the setter method! :)

released in 2.24.1 ty!

j-mendez added a commit that referenced this issue Jan 4, 2025

feat(website): add on_should_crawl_callback [#241]

31113dd

j-mendez closed this as completed Jan 4, 2025

shroom00 mentioned this issue Jan 4, 2025

Website subscriber implementation details, regarding on_should_crawl_callback #242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add callback with access to spider::page::Page #241

Add callback with access to spider::page::Page #241

shroom00 commented Jan 4, 2025

j-mendez commented Jan 4, 2025

shroom00 commented Jan 4, 2025

j-mendez commented Jan 5, 2025

Add callback with access to spider::page::Page #241

Add callback with access to spider::page::Page #241

Comments

shroom00 commented Jan 4, 2025

j-mendez commented Jan 4, 2025

shroom00 commented Jan 4, 2025

j-mendez commented Jan 5, 2025