-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathscraping.html
197 lines (189 loc) · 12.2 KB
/
scraping.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
<!DOCTYPE html>
<html lang="en-us">
<head>
<title>scraping.nim</title>
<link rel="icon" href="data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text y=%22.9em%22 font-size=%2280%22>🐳</text></svg>">
<meta content="text/html; charset=utf-8" http-equiv="content-type">
<meta content="width=device-width, initial-scale=1" name="viewport">
<link rel='stylesheet' href='https://unpkg.com/normalize.css/'>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/kognise/water.css@latest/dist/light.min.css">
<link rel='stylesheet' href='https://cdn.jsdelivr.net/gh/pietroppeter/nimib/assets/atom-one-light.css'>
</head>
<body>
<header>
<div id="header-box">
<span id="home"><a href=".">🏡</a></span>
<span id="header-title"><code>scraping.nim</code></span>
<span id="github"><a href="https://github.com/ajusa/binarylang-fun"><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" width="1.2em" height="1.2em" style="vertical-align: middle;" preserveAspectRatio="xMidYMid meet" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59c.4.07.55-.17.55-.38c0-.19-.01-.82-.01-1.49c-2.01.37-2.53-.49-2.69-.94c-.09-.23-.48-.94-.82-1.13c-.28-.15-.68-.52-.01-.53c.63-.01 1.08.58 1.23.82c.72 1.21 1.87.87 2.33.66c.07-.52.28-.87.51-1.07c-1.78-.2-3.64-.89-3.64-3.95c0-.87.31-1.59.82-2.15c-.08-.2-.36-1.02.08-2.12c0 0 .67-.21 2.2.82c.64-.18 1.32-.27 2-.27c.68 0 1.36.09 2 .27c1.53-1.04 2.2-.82 2.2-.82c.44 1.1.16 1.92.08 2.12c.51.56.82 1.27.82 2.15c0 3.07-1.87 3.75-3.65 3.95c.29.25.54.73.54 1.48c0 1.07-.01 1.93-.01 2.2c0 .21.15.46.55.38A8.013 8.013 0 0 0 16 8c0-4.42-3.58-8-8-8z" fill="#000"></path></svg></a></span>
</div>
<style>
div#header-box {
display: flex;
align-items: center;
justify-content: space-between;
}
</style>
<hr>
</header>
<main>
<h1>Webscraping with Binarylang</h1>
<p>Alright! Let's move onto the next problem, shall we?
A common real world problem that many people run into is extracting
information from a webpage. This is (usually) known as webscraping. There are a few
ways to do it, such as regex and using query selectors by parsing the DOM.</p>
<p>Anyway, I've got some HTML that looks like this:</p>
<pre><code class="language-html"><ul class="cardDeck cardGrid" data-type="anime"><li data-type="anime" data-id="14109" data-episode-type="episodes" data-episodes="" data-total-episodes="6" class="card ">
<a title="<h5 class='theme-font'>Bottom-tier Character Tomozaki</h5><h6 class='theme-font tooltip-alt'>Alt title: Jaku-Chara Tomozaki-kun</h6><ul class='entryBar'><li class='type'>TV (6+ eps)</li><li>Project No. 9</li><li class='iconYear'>2021 - ?</li><li><div class='ttRating'>3.6</div></li></ul><p>Expert gamer Tomozaki Fumiya doesn&rsquo;t exactly fit in, but he wishes he did. With no written rules for success and gameplay that doesn&rsquo;t work in his favor, the real world seems impossible for someone like him. But, like any noob, all he really needs are some strategies and a seasoned player like Aoi Hinami to help him. Hopefully with her guidance, Tomozaki will gain the experience he needs.</p><div class='tooltip notes'><p>Source: Funimation</p></div><div class='tags'><h4>Tags</h4><ul><li>Comedy</li><li>Drama</li><li>Romance</li><li>Shounen</li><li>School Life</li><li>Based on a Light Novel</li></ul></div> <div class='myListBar theirList sep'>
<h4>their anime:</h4>
<span class='status2'></span> Watching - 5/6 eps </div>
" href="/anime/bottom-tier-character-tomozaki" class="tooltip anime14109">
<div class="crop"><img alt="Bottom-tier Character Tomozaki" data-src="/images/anime/covers/thumbs/bottom-tier-character-tomozaki-14109.jpg?t=1610367923" src="/inc/img/card-load.svg" /></div><div class="statusArea"><span class='status2'></span> 5 eps</div> <h3 class='cardName'>Bottom-tier Character Tomozaki</h3>
</a>
</li><li data-type="anime" data-id="14295" data-episode-type="episodes" data-episodes="" data-total-episodes="5" class="card ">
<a title="<h5 class='theme-font'>Dr. Stone: Stone Wars</h5><ul class='entryBar'><li class='type'>TV (5+ eps)</li><li>TMS Entertainment</li><li class='iconYear'>2021 - ?</li><li><div class='ttRating'>4.6</div></li></ul><p>Second season of <a href=&quot;https://www.anime-planet.com/anime/dr-stone&quot; >Dr. Stone</a>.</p><div class='tags'><h4>Tags</h4><ul><li>Adventure</li><li>Comedy</li><li>Sci Fi</li><li>Shounen</li><li>Modern Knowledge</li><li>Person in a Strange World</li><li>Post-apocalyptic</li><li>Prehistoric</li><li>Survival</li><li>Based on a Manga</li></ul></div> <div class='myListBar theirList sep'>
<h4>their anime:</h4>
<span class='status2'></span> Watching - 4/5 eps </div>
" href="/anime/dr-stone-stone-wars" class="tooltip anime14295">
<div class="crop"><img alt="Dr. Stone: Stone Wars" data-src="/images/anime/covers/thumbs/dr-stone-stone-wars-14295.jpg?t=1599268423" src="/inc/img/card-load.svg" /></div><div class="statusArea"><span class='status2'></span> 4 eps</div> <h3 class='cardName'>Dr. Stone: Stone Wars</h3>
</a>
</li><li data-type="anime" data-id="15781" data-episode-type="episodes" data-episodes="" data-total-episodes="5" class="card ">
<a
...
</code></pre>
<p>Oh boy, this is a mess. There is some non-standard stuff going on within
the title attribute, having an entire other element inside of it.</p>
<p>What we want: A list of shows, and the watch status (how many episodes have been watched).
Hm, it looks like the title is between <code><h5 class='theme-font'></code> and <code></h5></code>, and the
watch status is also between some strings. Let's try it!</p>
<pre><code class="nim hljs">createParser(show):
s: _ <span class="hljs-comment"># skip until we see the next field</span>
s: _ = <span class="hljs-string">"<h5 class='theme-font'>"</span>
s: title
s: _ = <span class="hljs-string">"</h5>"</span>
s: _
s: _ = <span class="hljs-string">"Watching - "</span>
s: seen
s: _ = <span class="hljs-string">"/"</span>
s: total
s: _ = <span class="hljs-string">" eps"</span>
s: _
s: _ = <span class="hljs-string">"</li>"</span> <span class="hljs-comment"># Read until the end of the item</span>
print website.toShow</code></pre>
<pre><samp>toShow(website)=Show(title:"Bottom-tier Character Tomozaki", seen:"5", total:"6")</samp></pre>
<p>Wasn't that super easy! You don't need to parse the HTML dom, don't need
to figure out any regex, and you get a normal Nim type to work with! Now, let's
generalize this to all of the shows.</p>
<pre><code class="nim hljs">createParser(information):
*show: {shows}
s: _ = <span class="hljs-string">"</ul>"</span> <span class="hljs-comment"># Ends when the list ends</span>
print website.toInformation</code></pre>
<pre><samp>toInformation(website)=Information(
shows:@[
Show(title:"Bottom-tier Character Tomozaki", seen:"5", total:"6"),
Show(title:"Dr. Stone: Stone Wars", seen:"4", total:"5"),
Show(title:"Horimiya", seen:"5", total:"5"),
Show(title:"Mushoku Tensei: Jobless Reincarnation", seen:"5", total:"5"),
Show(title:"Re:ZERO -Starting Life in Another World- Season 2: Part II", seen:"6", total:"6"),
Show(title:"So I\'m a Spider, So What?", seen:"5", total:"6"),
Show(title:"Suppose a Kid from the Last Dungeon Boonies Moved to a Starter Town", seen:"6", total:"6"),
Show(title:"That Time I Got Reincarnated as a Slime Season 2", seen:"5", total:"5")
]
)</samp></pre>
<p>And that's it! The only tricky part is figuring out when to stop parsing
but so long as the website has some sort of structure this is pretty doable.</p>
</main>
<footer>
<hr>
<span id="made">made with <a href="https://github.com/pietroppeter/nimib">nimib 🐳</a></span>
<button id="show" onclick="toggleSourceDisplay()">Show Source</button>
<section id="source">
<pre><code class="nim hljs"><span class="hljs-keyword">import</span> binarylang, nimib, strutils, print, strformat
printColors = <span class="hljs-literal">false</span>
nbInit
nbText:<span class="hljs-string">"""
# Webscraping with Binarylang
Alright! Let's move onto the next problem, shall we?
A common real world problem that many people run into is extracting
information from a webpage. This is (usually) known as webscraping. There are a few
ways to do it, such as regex and using query selectors by parsing the DOM.
Anyway, I've got some HTML that looks like this:
"""</span>
<span class="hljs-keyword">var</span> website = readFile(<span class="hljs-string">"anime.html"</span>)
nbText: &<span class="hljs-string">"""
```html
{website[0..3000]}
...
```
"""</span>
nbText: <span class="hljs-string">"""
Oh boy, this is a mess. There is some non-standard stuff going on within
the title attribute, having an entire other element inside of it.
What we want: A list of shows, and the watch status (how many episodes have been watched).
Hm, it looks like the title is between `<h5 class='theme-font'>` and `</h5>`, and the
watch status is also between some strings. Let's try it!
"""</span>
nbCode:
createParser(show):
s: _ <span class="hljs-comment"># skip until we see the next field</span>
s: _ = <span class="hljs-string">"<h5 class='theme-font'>"</span>
s: title
s: _ = <span class="hljs-string">"</h5>"</span>
s: _
s: _ = <span class="hljs-string">"Watching - "</span>
s: seen
s: _ = <span class="hljs-string">"/"</span>
s: total
s: _ = <span class="hljs-string">" eps"</span>
s: _
s: _ = <span class="hljs-string">"</li>"</span> <span class="hljs-comment"># Read until the end of the item</span>
print website.toShow
nbText: <span class="hljs-string">"""
Wasn't that super easy! You don't need to parse the HTML dom, don't need
to figure out any regex, and you get a normal Nim type to work with! Now, let's
generalize this to all of the shows.
"""</span>
nbCode:
createParser(information):
*show: {shows}
s: _ = <span class="hljs-string">"</ul>"</span> <span class="hljs-comment"># Ends when the list ends</span>
print website.toInformation
nbText: <span class="hljs-string">"""
And that's it! The only tricky part is figuring out when to stop parsing
but so long as the website has some sort of structure this is pretty doable.
"""</span>
nbSave
</code></pre>
</section>
<script>
function toggleSourceDisplay() {
var btn = document.getElementById("show")
var source = document.getElementById("source");
if (btn.innerHTML=="Show Source") {
btn.innerHTML = "Hide Source";
source.style.display = "block";
} else {
btn.innerHTML = "Show Source";
source.style.display = "none";
}
}
</script>
<style>
span#made {
font-size: 0.8rem;
}
button#show {
font-size: 0.8rem;
}
button#show {
float: right;
padding: 2px;
padding-right: 5px;
padding-left: 5px;
}
section#source {
display:none
}
</style>
</footer>
</body>
</html>