💻 (11-Aug-2023) See your Python code do web browsing on your screen with GUI.
Before you try to scrape any website, go through its robots.txt file. You can access it via domainname/robots.txt
. There, you will see a list of pages allowed and disallowed for scraping. You should not violate any terms of service of any website you scrape.
With selenium we're limited to 10 max ongoing sessions (reference).
I've successfully tested 1000 site crawls in a single process (3 hours, 44 minutes, and 47 seconds).
(4 hours x 1000 sites) * 2 = 2000 sites x 8 hours
2000 sites * 10 parallel sessions = 20, 000 sites
We're able to cover 20, 000 sites / night / machine.
cp .env.example .env
pip3 install virtualenv && \
virtualenv env && \
source env/bin/activate
# chromedriver_mac64
# chromedriver_win32
# See https://chromedriver.storage.googleapis.com
# for drivers list.
wget https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/chromedriver
chromedriver --version
Update config.json with your real credentials.
Update the command at ./management/commands/crawl.py
alias py3="python3"
py3 manage.py crawl
# The app runs at `http://localhost:3000`.
If you still need help installing and running the app check out the readme at https://github.com/kkamara/python-react-boilerplate which is the base system for this python-selenium app.
alias compose='docker-compose -f local.yml'
compose build
compose up
# Automated runs with Docker:
# compose up --build -d && python3 manage.py crawl
python3 manage.py shell -i ipython
python manage.py show_urls
View the api collection here.
Admin creds are set in ./compose/local/django/start
export DJANGO_SUPERUSER_PASSWORD=secret
python manage.py createsuperuser \
--username admin_user \
--email admin@django-app.com \
--no-input \
--first_name Admin \
--last_name User
py3 manage.py collectstatic
Mail environment credentials are at .env.
The mailhog docker image runs at http://localhost:8025
.
See amazon scraper (proven in a production environment).
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.