-
-
Notifications
You must be signed in to change notification settings - Fork 743
Beginner: I'm new to scraping and being blocked
berstend̡̲̫̹̠̖͚͓̔̄̓̐̄͛̀͘ edited this page Apr 26, 2021
·
7 revisions
This wiki aims to be a beginner friendly entry point in understanding why this could happen and how to mitigate it.
Note: This document is only relevant if there are issues, if your custom shell script loop using curl runs fine that's great.
- The days where this was sufficient are long gone now 😄
- It's easy for a site to use JS to gather or calculate some data and require that in their backend (sent in the form of cookies/headers/post data)
- In addition most sites are built with dynamic JS nowadays, so static html scraping won't get you far
- Solution: Switch to a scraping framework which uses a real browser (puppeteer, playwright)
- Selenium is the grandfather of browser based scraping frameworks and leaks it's presence in too many ways
- This applies to anything that is not a real browser as well: Scrapy's Splash, PhantomJS, Electron, CasperJS, etc
- Solution: Don't use Selenium, use puppeteer or playwright
- By default the usage of puppeteer (in both headless and headful mode) can be detected by a site
- Solution: Use puppeteer-extra-plugin-stealth
- Don't try to emulate another browser engine or device type (e.g. mobile) when using a desktop browser
- Don't use data that doesn't make sense (e.g. macOS platform with a Nvidia RTX 3080 GPU)
- Don't pretend to be the latest Chrome version (e.g. User-Agent) if you're not
- Don't use free proxies from the internet, they are being detected as such easily
- Don't use Tor, all exit nodes are public and the network is meant for people in need
- Don't use your home internet too often or you might experience rate-limiting or bans
- Don't use datacenter IPs or proxies, they can be detected as not being "residential"
(TODO: Add more content here)