-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathExerciseSolutions.py
149 lines (127 loc) · 4.14 KB
/
ExerciseSolutions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# ---
# title: "Web scraping with Python Exercise Solutions"
# always_allow_html: yes
# output:
# html_document:
# highlight: tango
# toc: true
# toc_float:
# collapsed: true
# jupyter:
# jupytext_format_version: '1.3'
# jupytext_formats: ipynb,Rmd:rmarkdown,py:light,md:markdown
# kernelspec:
# display_name: Python 3
# language: python
# name: python3
# language_info:
# codemirror_mode:
# name: ipython
# version: 3
# file_extension: .py
# mimetype: text/x-python
# name: python
# nbconvert_exporter: python
# pygments_lexer: ipython3
# version: 3.7.0
# toc:
# base_numbering: 1
# nav_menu: {}
# number_sections: true
# sideBar: true
# skip_h1_title: true
# title_cell: Table of Contents
# title_sidebar: Contents
# toc_cell: false
# toc_position: {}
# toc_section_display: true
# toc_window_display: true
# ---
#
# <style type="text/css">
# pre code {
# display: block;
# unicode-bidi: embed;
# font-family: monospace;
# white-space: pre;
# max-height: 400px;
# overflow-x: scroll;
# overflow-y: scroll;
# }
# </style>
# + {"hide_input": true, "results": "'hide'"}
## This part is optional; it sets some printing options
## that make output look nicer.
from pprint import pprint as print
import pandas as pd
pd.set_option('display.width', 133)
pd.set_option('display.max_colwidth', 30)
pd.set_option('display.max_columns', 5)
# -
# ## Exercise: Retrieve exhibits data
#
# ## Exercise: Retrieve exhibits data {.smaller}
# In this exercise you will retrieve information about the art
# exhibitions at Harvard Art Museums from
# `https://www.harvardartmuseums.org/visit/exhibitions`
#
# 1. Using a web browser (Firefox or Chrome recommended) inspect the
# page at `https://www.harvardartmuseums.org/visit/exhibitions`. Examine
# the network traffic as you interact with the page. Try to find
# where the data displayed on that page comes from.
# +
## TODO
# -
# 2. Make a `get` request in Python to retrieve the data from the URL
# identified in step1.
# +
## TODO
# -
# 3. Write a *loop* or *list comprehension* in Python to retrieve data
# for the first 5 pages of exhibitions data.
# +
## TODO
# -
# 4. Bonus (optional): Arrange the data you retrieved into dict of
# lists. Convert it to a pandas `DataFrame` and save it to a `.csv`
# file.
# +
## TODO
# -
# ## Exercise: parsing HTML
# In this exercise you will retrieve information about the physical
# layout of the Harvard Art Museums. The web page at
# <https://www.harvardartmuseums.org/visit/floor-plan> contains this
# information in HTML from.
#
# 1. Using a web browser (Firefox or Chrome recommended) inspect the
# page at `https://www.harvardartmuseums.org/visit/floor-plan`. Copy
# the `XPath` to the element containing list of facilities located on
# **level 1**. information. (HINT: the element if interest is a `ul`,
# i.e., an "unordered list" of class `ifp-floors__rooms`.)
#
# 
#
# 2. Make a `get` request in Python to retrieve the web page at
# <https://www.harvardartmuseums.org/visit/floor-plan>. Extract the
# content from your request object and parse it using `html.fromstring`
# from the `lxml` library.
# +
from lxml import html
floor_plan = requests.get('https://www.harvardartmuseums.org/visit/floor-plan')
floor_plan_html = html.fromstring(floor_plan.text)
# -
# 3. Use the `XPath` you identified in step one to select the HTML list item
# containing level one information.
level_one = floor_plan_html.xpath('/html/body/main/section/ul/li[5]/div[2]/ul')[0]
# 4. Use a *for loop* or *list comprehension* to iterate over the
# sub-elements of the list item you selected in the previous step and
# extract the text from each one.
print([element.text_content() for element in level_one])
# 5. Bonus (optional): Extract the list of facilities available on each level.
# +
level_html = floor_plan_html.xpath('/html/body/main/section/ul/li')
level_info = [[element.text_content()
for element in level.xpath('div[2]/ul')[0]]
for level in level_html]
print(level_info)