Skip to content

WEEK 9 & Mid Sem Development

TingYao3 edited this page Sep 26, 2018 · 2 revisions

Data Scraping

One way to gather ongoing and upcoming event to be recommended in "In the Moment" is through data scraping of similar sites. Based on the research, despite data scraping comes with some legal issues; however, as long as it is not used for commercial but public interest.

Therefore, it is essential to understand and start implementing some basic data scraping to ensure that it can be implemented on the app.

A tutorial from the site ScrapeHero presented some very useful techniques for coding data scraping through python and its related libraries.

https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/

How does it work?

Data scraping is done through a three-step process: downloading the contents (text-based and formatted through HTML) of various web pages and then extract + store information from it. An image from ScrapeHero perfectly visualise this process:

Build a web scraper for EventBrite

One of the main online event platforms is Eventbrite, therefore a simple web scraper is built to extract information such as event description, title, time and location.

The scraper itself is coded through Python and BeautifulSoup.

  • Python is chosen as the main programming language as it contains various libraries that reduce the amount of actual coding
  • BeautifulSoup is a Python package used for parsing HTML and XML documents.

Step 1: Creating the connection

In this example, it only scraps all the event information from the first page of events in Australia, Brisbane city. (https://www.eventbrite.com.au/d/australia--brisbane-city/all-events/)

 # Adding a User-Agent String in the request to prevent getting blocked while scraping
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'

# download the URL and extract the content to the variable html
url = "https://www.eventbrite.com.au/d/australia--brisbane-city/all-events/"
html = urllib.request.urlopen(request).read()

Step 2: Extracting the content from the search page

The content is extracted through locating the specific HTML elements and its class/ attributes and then narrow down the extraction based on it.

# pass the HTML to Beautifulsoup for extraction
soup = BeautifulSoup(html,'html.parser')

# Narrow down and find the content within the search results table
main_table = soup.find("main",attrs={'data-spec':'search-results'})

#Now we go into main_table and get every individual search element
links = main_table.find_all("section",class_="eds-l-pad-all-6 eds-media-card-content eds-media-card-content--list eds-media-card-content--standard eds-media-card-content--fixed")

Step 3: Loop through extracting individual event information

# array to store the extracted data
extracted_data = []

# Loop through each individual element
for link in links: 

# Get the url to the specific event page
primaryLink = link.find("a",attrs={'class':'eds-media-card-content__action-link'});
title = primaryLink.text
url = primaryLink['href']

# Call the function parse_event_page that bascially extract information from each specific event page
result = parse_event_page(url)

# Some event page is formatted differently and the function will return None and currently it won't extract information from it.
if result is not None:
    # Add the information into the array
    extracted_data.append(parse_event_page(url))

Step 4: Store the extraction

Currently, the data is stored through a JSON, which later can be converted to storing it in a database.

#Lets write these to a JSON file for now. 
with open('data3.json', 'w') as outfile:
    json.dump(extracted_data, outfile, indent=4)

Example of the one of the event with information extracted and stored within the JSON file: (https://www.eventbrite.com.au/e/bbe-presents-rl-grime-goldlink-brisbane-tickets-48192146006?aff=ebdssbdestsearch)

[
{
    "title": "BBE presents RL GRIME + GOLDLINK [BRISBANE]",
    "price": "\n\t$99\n",
    "description": "\n**THIS SHOW IS 16+**\nWith his hotly anticipated album NOVA out now, RL Grime now announces a huge Australian album tour set for this November. Bringing along Goldlink as a very special guest, the tour will head around the country playing at iconic venues and includes stops at This That and Spilt Milk festivals. \nRL Grime is no stranger to Australian audiences having sold out multiple venues on his biggest headline tour to date in 2017 and performing at festivals including Splendour in the Grass. You won\u2019t want to miss out when he brings Goldlink with him to Australia later this year.\n**ALL SALES ARE FINAL**\n",
    "date and time and location": "\nDate and Time\n\n\n\n\nSat., 10/11/2018, 6:00 pm AEST\n\n\nAdd to Calendar\n\n\n\nLocation\n\nBrisbane Showgrounds\n600 Gregory Terrace \nBowen Hills, QLD 4006 \n\nView Map\nView Map\n\n\nRefund Policy\n\nNo Refunds\n\n"
},