This project automates the process of scraping news articles from various sources using BeautifulSoup and Selenium, integrated into a Django application. It supports multiple websites and can run scraping tasks concurrently using threading. The data is stored in an Excel file and optionally in a MySQL database.
- Scrapes news articles from Hindustan Times, Hindustan Times Bangla, Zee News, TV9 Bangla, and Ananda Bazar.
- Concurrent scraping using threading with a delay between iterations.
- Supports both on-demand scraping and scheduled scraping tasks.
- Saves scraped data to Excel files and a MySQL database.
- Creates a new folder for data storage based on the current date.
To get a local copy up and running, follow these steps.
- Python 3.6+
- Django 3.0+
- Selenium
- BeautifulSoup
- MySQL (for database storage)
- Clone the repository:
git clone https://github.com/ThisIs-Developer/News-Scraping-using-BeautifulSoup-Selenium-with-Django.git
- Navigate to the project directory:
cd News-Scraping-using-BeautifulSoup-Selenium-with-Django
- Install required Python packages:
pip install -r requirements.txt
- Set up the Django project:
python manage.py migrate python manage.py createsuperuser
- Update the database configuration in
settings.py
if using MySQL.
Web scraping is a technique for extracting information from the internet automatically using software that simulates human web surfing.
Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms. It can be used for automating web browsers to do a number of tasks such as web-scraping.
To install Selenium:
pip install selenium # (Python 2)
pip3 install selenium # (Python 3)
Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires geckodriver, which needs to be installed before the below examples can be run. Note that the webdriver must be located in your PATH
, e.g., place it in /usr/bin
or /usr/local/bin
.
Other supported browsers will have their own drivers available. Links to some of the more popular browser drivers are as follows:
- Chrome: https://sites.google.com/a/chromium.org/chromedriver/downloads
- Edge: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
- Firefox: https://github.com/mozilla/geckodriver/releases
- Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/
For this project, I am using Chrome's webdriver called Chromedriver. There are multiple ways to install Chromedriver:
-
Using webdriver-manager (recommended)
- Install package:
pip install webdriver-manager # (Python 2) pip3 install webdriver-manager # (Python 3)
- Load package:
from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager driver = webdriver.Chrome(ChromeDriverManager().install())
- Install package:
-
Manual download from Chrome's website
- Load package:
from selenium import webdriver driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
- Load package:
Run the Django development server:
python manage.py runserver
Navigate to the admin panel, configure the scraping tasks, and start the scraping process. The scraped data will be saved in the specified formats and locations.
- Initial release with scraping from Hindustan Times.
- Added scraping from Hindustan Times Bangla.
- Added scraping from Zee News.
- Added scraping from TV9 Bangla.
- Added scraping from Ananda Bazar.
- Dynamic scraping based on request value.
- Appending data to existing
scraped_data.xlsx
.
- Concurrent scraping with threading.
- Automatic folder creation based on current date.
- Integration with MySQL database and updated Django models.
Distributed under the Apache License 2.0. See LICENSE
for more information.