This is a Python-based web scraping and data analysis application built with Streamlit. It allows users to scrape data from websites using sitemap URLs, export the scraped data to CSV or JSON, analyze the data, and reset the database.
-
Scrape Data:
- Scrape data from a sitemap URL.
- Save the scraped data to an SQLite database.
-
Export Data:
- Export the scraped data to CSV or JSON.
-
Analyze Data:
- Analyze the scraped data and visualize the number of pages per domain.
-
Reset Database:
- Delete all scraped data from the database.
-
Clone the repository:
git clone https://github.com/ManishPJha/web-scraper-analyzer.git cd web-scraper-analyzer
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the Streamlit app:
streamlit run streamlit_app.py
-
Scrape Data:
- Enter a sitemap URL and specify the number of pages to scrape.
- Click Start Scraping to begin the scraping process.
-
Export Data:
- Choose the export format (CSV or JSON).
- Click Export Data to save the scraped data to a file.
-
Analyze Data:
- View a bar chart showing the number of pages per domain.
-
Reset Database:
- Click Reset Database to delete all scraped data.
web-scraper-analyzer/
├── database.py # Database operations
├── scraper.py # Web scraping logic
├── exporter.py # Data export logic
├── analyzer.py # Data analysis logic
├── streamlit_app.py # Main Streamlit app
├── requirements.txt # List of dependencies
├── README.md # Project documentation
streamlit
aiohttp
beautifulsoup4
fake-useragent
pandas
matplotlib
sqlite3
python-dotenv
Install all dependencies using:
pip install -r requirements.txt
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.
- Manish Jha
- GitHub: ManishPJha
- Email: mjha205@rku.ac.in