Web Scraper and Data Analyzer

This is a Python-based web scraping and data analysis application built with Streamlit. It allows users to scrape data from websites using sitemap URLs, export the scraped data to CSV or JSON, analyze the data, and reset the database.

Features

Scrape Data:
- Scrape data from a sitemap URL.
- Save the scraped data to an SQLite database.
Export Data:
- Export the scraped data to CSV or JSON.
Analyze Data:
- Analyze the scraped data and visualize the number of pages per domain.
Reset Database:
- Delete all scraped data from the database.

Installation

Clone the repository:

git clone https://github.com/ManishPJha/web-scraper-analyzer.git
cd web-scraper-analyzer

Install the required dependencies:
```
pip install -r requirements.txt
```
Run the Streamlit app:
```
streamlit run streamlit_app.py
```

Usage

Scrape Data:
- Enter a sitemap URL and specify the number of pages to scrape.
- Click Start Scraping to begin the scraping process.
Export Data:
- Choose the export format (CSV or JSON).
- Click Export Data to save the scraped data to a file.
Analyze Data:
- View a bar chart showing the number of pages per domain.
Reset Database:
- Click Reset Database to delete all scraped data.

File Structure

web-scraper-analyzer/
├── database.py          # Database operations
├── scraper.py           # Web scraping logic
├── exporter.py          # Data export logic
├── analyzer.py          # Data analysis logic
├── streamlit_app.py     # Main Streamlit app
├── requirements.txt     # List of dependencies
├── README.md            # Project documentation

Dependencies

streamlit
aiohttp
beautifulsoup4
fake-useragent
pandas
matplotlib
sqlite3
python-dotenv

Install all dependencies using:

pip install -r requirements.txt

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Manish Jha
GitHub: ManishPJha
Email: mjha205@rku.ac.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper and Data Analyzer

Features

Installation

Usage

File Structure

Dependencies

Contributing

License

Author

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyzer.py		analyzer.py
database.py		database.py
exporter.py		exporter.py
requirements.txt		requirements.txt
scraper.py		scraper.py
streamlit_app.py		streamlit_app.py

License

ManishPJha/web-scraper-analyzer

Folders and files

Latest commit

History

Repository files navigation

Web Scraper and Data Analyzer

Features

Installation

Usage

File Structure

Dependencies

Contributing

License

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages