Best solution for scraping a large number of pages #493

xao0isb · 2024-09-16T12:52:40Z

xao0isb
Sep 16, 2024

We need to scrape a large number of pages. On 100 urls scrape percentage is about 60% - it's fine. But if number of urls increases then scrape percentage drops. For example for 400 urls scrape percentage is 32%. At the start we create one browser instance and then we do this for each page (for context: we need to use proxy for each page and for each one we pick random proxy from the list):

create context with proxyServer parameter
create page with proxy in context
filter out Stylesheet, Image, Font, Media requests
wait until page fully loaded recursively checking for page.network.wait_for_idle
get the body
close the page
dispose context

If following errors occurred during algorithm then scrape is unsuccessful: Ferrum::StatusError, Ferrum::PendingConnectionsError, Ferrum::NoSuchTargetError. Or if page can't be fully loaded (we waited too long for page.network.wait_for_idle to return success).

And this algorithm uses for each page. What you can suggest to improve scrape percentage? Maybe we should create context for each proxy from the list, then pick one random context, open page in this context, scrape and then close the page? Or maybe we should even create browser for each proxy, then pick one random browser, open page in it, scrape and then close the page? Any suggestions? The last thing that comes to mind is to quit browser after certain amount of pages scraped and then create new one so we can keep scrape percentage at one point.

Also worth to mention: after each of code run we have a lot chrome processes hanging - maybe this is the problem and code just spam the chrome processes? About RAM: for 100 urls RAM usage at pick was 5.6 GB; for 400 urls RAM usage at pick was 15.7 GB.

Thanks.

@route

route · 2024-11-18T18:25:42Z

route
Nov 18, 2024
Maintainer

First off turn off PendingConnectionsError as it's only useful for tests with Cuprite.
Second definitely you shouldn't have lot's Chromes hanging in the wild, something is fishy with your code that doesn't kill browser at the given point in time.

I'd rather go with fresh context which you dispose at the end of the session.
Create new context with given proxy settings, visit one url, collect data, dispose context, repeat.
This is the safest way. Expect timeouts everywhere, it's network/internet and things might not work or work very slowly.
Headful/less crawling should be done slowly, as speed per website shouldn't matter. Want to do it faster scale vertically.

That's all from POV and experience.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best solution for scraping a large number of pages #493

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Best solution for scraping a large number of pages #493

xao0isb Sep 16, 2024

Replies: 1 comment

route Nov 18, 2024 Maintainer

xao0isb
Sep 16, 2024

route
Nov 18, 2024
Maintainer