You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we have on_link_find_callback, which allows links to be changed depending on some function. This has its uses, but a downside of this is that it only takes the url as a parameter. However, there are scenarios where you can only know if a page should be crawled depending on the page content.
For example, I have a web crawler that I use for search engine indexing. When crawling, I have come across pages that are alternative Reddit frontends (e.g. libreddit). With my old crawler (using Python's scrapy library) I had a callback that filtered pages depending on CSS selectors (libreddit frontends can be detected based on a logo with specific characteristics). I don't want to crawl these websites as the amount of content is huge, only the website root is of use to me.
With a new callback that takes the Page as a parameter, something like this is possible. Maybe it could be called should_crawl_callback or something like that. It could return a bool, and (dis)allow crawling links etc. found on that page.
The text was updated successfully, but these errors were encountered:
Currently, we have
on_link_find_callback
, which allows links to be changed depending on some function. This has its uses, but a downside of this is that it only takes the url as a parameter. However, there are scenarios where you can only know if a page should be crawled depending on the page content.For example, I have a web crawler that I use for search engine indexing. When crawling, I have come across pages that are alternative Reddit frontends (e.g. libreddit). With my old crawler (using Python's scrapy library) I had a callback that filtered pages depending on CSS selectors (libreddit frontends can be detected based on a logo with specific characteristics). I don't want to crawl these websites as the amount of content is huge, only the website root is of use to me.
With a new callback that takes the
Page
as a parameter, something like this is possible. Maybe it could be calledshould_crawl_callback
or something like that. It could return abool
, and (dis)allow crawling links etc. found on that page.The text was updated successfully, but these errors were encountered: