Replies: 1 comment
-
First off turn off I'd rather go with fresh context which you dispose at the end of the session. That's all from POV and experience. |
Beta Was this translation helpful? Give feedback.
-
First off turn off I'd rather go with fresh context which you dispose at the end of the session. That's all from POV and experience. |
Beta Was this translation helpful? Give feedback.
-
We need to scrape a large number of pages. On 100 urls scrape percentage is about 60% - it's fine. But if number of urls increases then scrape percentage drops. For example for 400 urls scrape percentage is 32%. At the start we create one browser instance and then we do this for each page (for context: we need to use proxy for each page and for each one we pick random proxy from the list):
proxyServer
parameterStylesheet
,Image
,Font
,Media
requestspage.network.wait_for_idle
If following errors occurred during algorithm then scrape is unsuccessful:
Ferrum::StatusError
,Ferrum::PendingConnectionsError
,Ferrum::NoSuchTargetError
. Or if page can't be fully loaded (we waited too long forpage.network.wait_for_idle
to return success).And this algorithm uses for each page. What you can suggest to improve scrape percentage? Maybe we should create context for each proxy from the list, then pick one random context, open page in this context, scrape and then close the page? Or maybe we should even create browser for each proxy, then pick one random browser, open page in it, scrape and then close the page? Any suggestions? The last thing that comes to mind is to quit browser after certain amount of pages scraped and then create new one so we can keep scrape percentage at one point.
Also worth to mention: after each of code run we have a lot chrome processes hanging - maybe this is the problem and code just spam the chrome processes? About RAM: for 100 urls RAM usage at pick was 5.6 GB; for 400 urls RAM usage at pick was 15.7 GB.
Thanks.
@route
Beta Was this translation helpful? Give feedback.
All reactions