This python script is a data pipeline that scrapes product data from Walmart.com based on a specified keyword, and saves the data to an S3 bucket in CSV file format. The data then used to create reports and visualizations using Power BI.
Before running the script, certain changes need to be made in the settings.py file.
- Python 3.6 or later
- Scrapy Framework
- ScrapeOps Proxy Rotator- Sign up for a free trial of ScrapeOps Proxy Rotator and get your API key
- Botocore
- Create AWS S3 Bucket
- Clone the repository using bash:
git clone hhttps://github.com/usmananwaar-de/walmart-scraper
- Navigate to the cloned directory in the command line.
- Create a virtual environment by running the command (For Windows):
python -m venv [venv name]
- Activate the virtual environment by running the command:
venv\Scripts\activate
- Change the directory using
cd
command and go into the spiders folder - Install the required libraries by running the command
pip install -r requirements.txt
Before running the script, the following changes need to be made in the settings.py file:
- Open the
settings.py
file and replace theYOUR_SCRAPEOPS_API_KEY
variable with your ScrapeOps API key - Replace the
YOUR_S3_BUCKET_PATH
with the path to your S3 bucket. - Replace the
YOUR_AWS_KEY_ID
andYOUR_AWS_SECRET_ACCESS_KEY
with your AWS access key ID and secret access key..
To run the script, use the following command and write the desired product keyword:
scrapy crawl walmart
**Note: rotating proxies are used because Walmart.com detects scraper bots and blocks their IP addresses.**
- Walmart.com may block your scraper, so it's important to use a proxy service like ScrapeOps to avoid this
- Open Power BI Desktop and Click on Get Data. Search for Python Script and copy and paste the following code:
import boto3
AWS_ACCESS_KEY = "your-aws-access-key" AWS_SECRET_ACCESS_KEY = "your-aws-secret-access-key" AWS_DEFAULT_REGION = "your-aws-region" s3 = boto3.resource('s3') bucket = s3.Bucket('your-bucket-name/file.csv') for obj in bucket.objects.all(): key = obj.key body = obj.get()['Body'].read()
- Make sure you've boto3 install. That's all, csv file will be successfully imported
To dive into the interactive world of this Power BI report, simply download the "Walmart report.pbix" file and unleash the power of data visualization with Power BI.
Thank you for reading till the end!