This project is intended to develop a tool which would enable a user to block certain specific contents from their feed. These content may include words/phrases, competitor ads, irritating pop-ups, etc. Alongside exploring Web Scraping & it's possibilities. Started as a webinar material for M.Sc Data analytics students.
Collecting data from websites using an automated process is known as web scraping.
Selenium Documentation: https://selenium-python.readthedocs.io/
Installation instructions for Selenium:
Installation instructions: 1) pip install selenium
2) Download chrome web driver from "https://sites.google.com/chromium.org/driver/downloads?authuser=0"
My Google Chrome version is: 'Version 99.0.4844.74 (Official Build) (arm64)'
If you are using MacBook, for the first time you would need to unquarantine the chrome driver. Open a terminal window at the location you have kept your chromedriver. The command you can use to do so is:
xattr -d com.apple.quarantine chromedriver
For accessing APIs, use postman which can be downloaded from: https://www.postman.com/downloads/
For accessing Jupyter notebook instance online: https://jupyter.org/try
References:
- https://realpython.com/python-web-scraping-practical-introduction/
- https://www.youtube.com/watch?v=Xjv1sY630Uc&list=PLzMcBGfZo4-n40rB1XaJ0ak1bemvlqumQ
- https://www.analyticsvidhya.com/blog/2021/12/text-classification-of-news-articles/
- https://www.kaggle.com/c/learn-ai-bbc/data
- https://github.com/DedSecInside/TorBot