A Serverless Webscraping Tutorial

Scrape the Amazon Alexa Voice Deals meta data on a daily basis using a serverless infrastructure.

Tools

AWS CLI
pipenv
AWS DynamoDB
AWS S3
zappa
requests
lxml
boto3

Set Up

Tutorial assumes AWS CLI is installed and credentials are properly set up in ~/.aws/credentials. Tutorial also assumes S3, API Gateway, DynamoDB, and Lambda are available in the region specified in ~/.aws/config. See Configuring the AWS CLI for more details.

Install pipenv

pip install pipenv. It's awesome!

$ pip3 install pipenv

Clone the repo

$ git clone git@github.com:lmeraz/serverless-webscraper.git

Install the virutal environment

pipenv installs all the packages needed, in this case, zappa, requests, boto3, and lmxl.

$ pipenv install

Provision a DynamoDB table

DynamoDB will store the data. Replace MyTableName with your desired table name.

$ aws dynamodb create-table \
    --table-name MyTableName \
    --attribute-definitions \
        AttributeName=Date,AttributeType=S \
        AttributeName=ProductID,AttributeType=S \
    --key-schema AttributeName=Date,KeyType=HASH AttributeName=ProductID,KeyType=RANGE \
    --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1

Create the .env file

The webscraper uses environment variables to operate locally and on lambda. Create a .env with the following. Replace MyTableName with your dynamodb table.

target_url = https://www.amazon.com/b?node=16924218011
proxy = {"http": "http://202.159.203.71:80"}
timeout = 5
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}
container_query = //ul[@class='promotion-list']//li[@class='promotion']//span[@class='promotion-detail']
title_query = //span[@class='title']/text()
final_price_query = //span[@class='final-price-display-string']/text()
buy_price_query = //span[@class='buy-price']/text()
utterance_query = //span[@class='golden-utterance-content']/text()
href_query = //a/@href
img_query = //img/@src
table = MyTableName

Provision an S3 bucket

The S3 bucket stages our code for Lambda. Replace my-bucket-name with your bucket name.

$ aws s3 mb s3://my-bucket-name

In zappa_settings.json replace my-bucket-name with your bucket name

Test locally

Start the virtualenv

$ pipenv shell

Run the module.

(virtualenv)$ python app.py

Deploy

Deploy the module

(virtualenv) $zappa deploy dev

Finally, set up set up the environment variables in the AWS Lambda console.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
zappa_settings.json		zappa_settings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Serverless Webscraping Tutorial

Tools

Set Up

Install pipenv

Clone the repo

Install the virutal environment

Provision a DynamoDB table

Create the .env file

Provision an S3 bucket

Test locally

Deploy

About

Releases

Packages

Languages

lmeraz/serverless-webscraping-tutorial

Folders and files

Latest commit

History

Repository files navigation

A Serverless Webscraping Tutorial

Tools

Set Up

Install pipenv

Clone the repo

Install the virutal environment

Provision a DynamoDB table

Create the .env file

Provision an S3 bucket

Test locally

Deploy

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages