-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter1.Rmd
723 lines (476 loc) · 29.1 KB
/
chapter1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
---
title : Exploring Polling Data in R
description : This chapter will show you how to explore, refine, and visualize a sample of polling data from the 2016 primaries. You'll start with simple visualizations using base R commands before proceeding to more complicated plots in the ggplot2 package. The final exercises teach you how to generate maps in R using the maps package and the googleVis package.
attachments :
--- type:MultipleChoiceExercise lang:r xp:50 skills:1 key:9245d968f0
## Reading the polls
The first step in working with polling data is understanding what types of conclusions we can derive from the polls. Election polls are surveys designed to estimate voter preferences and generally rely on a large, representative sample of a given population. However, polling data can often be biased, unrepresentative of the population, or missing important variables related to election outcomes (for example, whether someone will actually vote).
Even the best polls are only an approximation of voter preferences. A poll that was taken three months, three weeks, or even three hours before the election still can't tell you exactly how someone will behave in the voting booth.
The plot on the right shows the percentage of voters supporting each candidate across all states in the 2016 Republican primaries. Which of the following is a reasonable conclusion to draw from these data?
*** =instructions
- Ted Cruz is unlikely to win a single state.
- Donald Trump is generally the most popular candidate, but may lag behind other candidates in certain states.
- Donald Trump is guaranteed to win the Republican nomination.
- Donald Trump will likely win the general election against the Democratic candidate in November.
*** =hint
Take a look at the plot. Which candidate appears to be leading? What can these data tell us about voting outcomes?
*** =pre_exercise_code
```{r}
polls <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1635/datasets/polls.csv")
polls$polldate <- as.Date(polls$polldate)
polls$electiondate <- as.Date(polls$electiondate)
library(ggplot2)
library(scales)
ggplot(data=subset(polls, type=="rep")) +
geom_jitter(aes(y=percent, x=polldate, col=candidate), alpha=0.3, na.rm=TRUE) +
geom_smooth(aes(y=percent, x=polldate, col=candidate), span=0.5, se=FALSE, alpha=0.7, na.rm=TRUE) +
scale_x_date(breaks = date_breaks("1 month"), labels = date_format("%b %Y")) +
scale_y_continuous(breaks = c(0, 20, 40, 60)) +
labs(x="Poll Date", y="") +
scale_colour_discrete(breaks=c("trump", "cruz", "kasich", "rubio"),
labels=c("Donald Trump","Ted Cruz", "John Kasich","Marco Rubio")) +
theme(legend.position="top", legend.title=element_blank())
```
*** =sct
```{r}
msg_bad <- "Not quite right! That may be true, but we can't tell from these polls."
msg_success <- "Exactly! Donald Trump has a lead on average, but that doesn't mean he'll win every state."
test_mc(correct = 2, feedback_msgs = c(msg_bad, msg_success, msg_bad, msg_bad))
```
--- type:NormalExercise lang:r xp:100 skills:1, 3 key:c03ca11be3
## Looking under the hood
In the previous exercise, you viewed polling data from the 2016 Republican primaries. In this exercise, you'll take a look at a larger dataset containing a sample of polls from both the Republican and Democratic primaries.
A dataset with these data, `polls`, is available in the workspace.
*** =instructions
- Take a look at the structure of `polls` using `str()`.
- Select polls for the Democratic primaries only (`type == "dem"`). Assign these polls to `dem_polls`.
- Use R's base plot function, `plot()`, to plot the date of the poll (`dem_polls$polldate`) on the x-axis, candidate support (`dem_polls$percent`) on the y-axis, and give each candidate (`dem_polls#candidate`) their own color.
*** =hint
- Use `str()` for the first instruction.
- For the second instruction, you should use `subset(polls, polls$type= "..."`.
- For the plot, use `plot(x = ..., y = ..., col = ...)`.
*** =pre_exercise_code
```{r}
polls <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1635/datasets/polls.csv")
polls$polldate <- as.Date(polls$polldate)
polls$electiondate <- as.Date(polls$electiondate)
```
*** =sample_code
```{r}
# polls is available in your workspace
# Check out the structure of polls
# Select polls for the Democratic primaries only: dem_polls
dem_polls <- subset(polls, polls$type == "___")
# Using dem_polls, plot polldate on the x-axis, percent (i.e. percent of voters supporting a candidate) on the y-axis, and set the color using candidate
plot(___, ___, col= ___)
```
*** =solution
```{r}
# polls is available in your workspace
# Check out the structure of polls
str(polls)
# Select polls for the Democratic primaries only: dem_polls
dem_polls <- subset(polls, polls$type == "dem")
# Using dem_polls, plot polldate on the x-axis, percent (i.e. percent of voters supporting a candidate) on the y-axis, and set the color using candidate
plot(dem_polls$polldate, dem_polls$percent, col = dem_polls$candidate)
```
*** =sct
```{r}
# SCT written with testwhat: https://github.com/datacamp/testwhat/wiki
test_function("str", args = "object",
not_called_msg = "You didn't call `str()`!",
incorrect_msg = "You didn't call `str(object = ...)` with the correct argument, `object`.")
test_object("dem_polls")
test_function("plot", args = "x")
test_function("plot", args = "y")
test_function("plot", args = "col")
test_error()
success_msg("Great work! Your polling data are accompanied by information about the poll operator (`pollster`) and sample size. That plot looks a little messy. In the next exercise, you'll use the `ggplot2` package to produce a more intuitive plot.")
```
--- type:NormalExercise lang:r xp:100 skills:1 key:b29b654048
## Visualizing polling trends
Your previous plot of Democratic primary polls was difficult to interpret because it didn't have a legend and didn't obviously show trends over time. To draw conclusions from polling data, analysts (and the general public!) often rely on crisp and clear visualizations. In this exercise, you'll use the [ggplot2](http://www.rdocumentation.org/packages/ggplot2/versions/2.1.0) package to produce a more intuitive plot of Democratic primary polls.
In this exercise, you'll return to the complete dataset of polls (`polls`), which is preloaded in your workspace.
*** =instructions
- Recreate your previous plot using the [ggplot()](http://www.rdocumentation.org/packages/ggplot2/versions/2.1.0/topics/ggplot). Select only data from Democratic primaries (`pollstype = "dem"`) using `subset()`. Plot `percent` on the y-axis, `polldate` on the x-axis, and set color to `candidate`. Store your plot as `dem_plot`.
- View your new plot.
- Add a trend line to `dem_plot` using [geom_smooth()](http://www.rdocumentation.org/packages/ggplot2/versions/2.1.0/topics/geom_smooth). Set the `span` equal to 0.5 and remove confidence intervals (`se = FALSE`).
*** =hint
- Use `subset(polls, polls$type == "dem")` to select only Democratic primaries.
- Make sure you set y = `percent`, x = `polldate`, and col = `candidate`.
*** =pre_exercise_code
```{r}
polls <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1635/datasets/polls.csv")
polls$polldate <- as.Date(polls$polldate)
polls$electiondate <- as.Date(polls$electiondate)
```
*** =sample_code
```{r}
# Load the ggplot2 package
# Create a plot of Democratic candidate support (percent) over time (polldate), setting color to candidate
dem_plot <- ggplot(data = subset(polls, polls$type == "___"), aes(y = ___, x = ___, col = ___)) +
geom_point(alpha = 0.5)
# View your new plot
# Add a trend line for each candidate using geom_smooth()
dem_plot +
geom_smooth(___)
```
*** =solution
```{r}
# Load the ggplot2 package
library(ggplot2)
# Create a plot of Democratic candidate support (percent) over time (polldate), setting color to candidate
dem_plot <- ggplot(data = subset(polls, polls$type == "dem"), aes(y = percent, x = polldate, col = candidate)) +
geom_point(alpha = 0.5)
# View your new plot
dem_plot
# Add a trend line for each candidate using geom_smooth()
dem_plot +
geom_smooth(span = 0.5, se = FALSE)
```
*** =sct
```{r}
# SCT written with testwhat: https://github.com/datacamp/testwhat/wiki
test_function("ggplot", args = "data",
not_called_msg = "Make sure to call `ggplot()`",
args_not_specified_msg = "Have you specified the argument `data` in `ggplot`?",
incorrect_msg = "Have you correctly specified the argument `data` in `ggplot`?")
test_function("ggplot", args = "mapping")
test_function("geom_smooth", args = "span")
test_function("geom_smooth", args = "se")
test_error()
success_msg("Excellent! Your new plot of Democratic primary polls is crisp and easy to understand.")
```
--- type:NormalExercise lang:r xp:100 skills:1, 3, 8 key:a85763af21
## Improving poll data quality
You just made a great plot showing support for Hillary Clinton and Bernie Sanders over the course of the 2016 Democratic primaries. Before you draw any conclusions from this plot, you'll want to address issues of data quality. Even the best polls can suffer from inaccuracy caused by low sample size or pollster bias.
In this exercise, you'll search your polling data for unrepresentative polls and refine your dataset to contain only the most accurate polls.
`polls` is available in your workspace.
*** =instructions
- One common source of inaccuracy in polls is low sample size. Take a look at sample sizes across your data using [summary()](http://www.rdocumentation.org/packages/base/versions/3.3.1/topics/summary).
- Generate a histogram showing the distribution of polls where `samplesize` is less than 1000. A vector of breaks for your histogram (`breaks`) has been prepared for you.
- Create a new data frame that contains only polls with a sample size greater than 400. Save this as `polls2`.
*** =hint
- Specify the sample size variable within your `polls` data frame using `polls$samplesize`.
- Include only a subset of your data using `polls[polls$samplesize > ..., ]`.
*** =pre_exercise_code
```{r}
polls <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1635/datasets/polls.csv")
polls$polldate <- as.Date(polls$polldate)
polls$electiondate <- as.Date(polls$electiondate)
breaks <- seq.int(0, 1300, 100)
```
*** =sample_code
```{r}
# Summarize samplesize (contained in the polls data frame)
summary(___)
# Create a histogram showing the distribution of polls with a sample size below 1000. Use the vector `breaks` for your histogram breaks.
hist(___[polls$samplesize < 1000], breaks)
# Create a new data frame that contains only polls with a sample size greater than 400: polls2
polls2 <-
```
*** =solution
```{r}
# Summarize samplesize (contained in the polls data frame)
summary(polls$samplesize)
# Create a histogram showing the distribution of polls with a sample size below 1000. Use the vector `breaks` for your histogram breaks.
hist(polls$samplesize[polls$samplesize < 1000], breaks)
# Create a new data frame that contains only polls with a sample size greater than 400: polls2
polls2 <- polls[polls$samplesize > 400, ]
```
*** =sct
```{r}
# SCT written with testwhat: https://github.com/datacamp/testwhat/wiki
test_function("hist", args = "x")
test_object("polls2")
test_error()
success_msg("Well done! You've removed a key source of inaccuracy in your data. With these refined data, you are ready to produce more accurate plots.")
```
--- type:NormalExercise lang:r xp:100 skills:1 key:ede1dd57e1
## Visualizing the refined data
Now that you've refined your polling data, you're ready to produce more accurate plots In this exercise, you'll create new plots for the Democratic and Republican primaries using the `ggplot2` package.
To view both plots at once, you'll need to use the [grid.arrange()](http://www.rdocumentation.org/packages/gridExtra/versions/2.2.1/topics/arrangeGrob) command from the [gridExtra](http://www.rdocumentation.org/packages/gridExtra/versions/2.2.1) package. This command creates a grid which you can fill with plot objects.
`polls2` is available in your workspace and ``ggplot2`` and `gridExtra` have been preloaded.
*** =instructions
- Recreate your plot of the Democratic polling data using `polls2`. Remember to use `subset` with `type == "dem"`, set the y-axis to `percent`, set the x-axis to `polldate`, and set color to `candidate`. Save this plot as `dem_plot2`.
- Create a similar plot using the Republican polling data contained in `polls2`. Save this plot as `rep_plot2`.
- View both plots together using `grid.arrange()`.
*** =hint
- Use `subset(polls2, polls2$type == "dem")` to select only Democratic primaries. Use a similar command for Republican primaries (`type == "rep"`).
- Make sure you set y = `percent`, x = `polldate`, and col = `candidate` in each plot.
*** =pre_exercise_code
```{r}
polls <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1635/datasets/polls.csv")
polls$polldate <- as.Date(polls$polldate)
polls$electiondate <- as.Date(polls$electiondate)
polls2 <- polls[polls$samplesize > 400, ]
library(ggplot2)
library(gridExtra)
```
*** =sample_code
```{r}
# Recreate your plot of the Demcoratic polling data using polls2: dem_plot2
dem_plot2 <- ggplot(data = subset(___), aes(y = ___, x = ___, col = ___)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.5, se = FALSE, na.rm = TRUE)
# Create a similar plot for the Republican primaries: rep_plot2
rep_plot2 <-
# View both plots together (do not modify this command)
grid.arrange(dem_plot2, rep_plot2, nrow = 2)
```
*** =solution
```{r}
# Recreate your plot of the Demcoratic polling data using polls2: dem_plot2
dem_plot2 <- ggplot(data = subset(polls2, polls2$type == "dem"), aes(y = percent, x = polldate, col = candidate)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.5, se = FALSE, na.rm = TRUE)
# Create a similar plot for the Republican primaries: rep_plot2
rep_plot2 <- ggplot(data = subset(polls2, polls2$type == "rep"), aes(y = percent, x = polldate, col = candidate)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.5, se = FALSE, na.rm = TRUE)
# View both plots together (do not modify this command)
grid.arrange(dem_plot2, rep_plot2, nrow = 2)
```
*** =sct
```{r}
# SCT written with testwhat: https://github.com/datacamp/testwhat/wiki
test_function("ggplot", args = "data", index = 1,
not_called_msg = "You didn't call `ggplot()`!",
incorrect_msg = "You didn't call `ggplot(data=...)` with the correct argument")
test_function("ggplot", args = "mapping", index = 1)
test_function("ggplot", args = "data", index = 2,
not_called_msg = "You didn't call `ggplot()`!",
incorrect_msg = "You didn't call `ggplot(data=...)` with the correct argument")
test_function("ggplot", args = "mapping", index = 2)
test_error()
success_msg("Excellent! Those plots provide a much better basis for drawing conclusions about the popularity of each candidate.")
```
--- type:NormalExercise lang:r xp:100 skills:1 key:dfbe1e6b60
## The fifty-state strategy
Your plots look great! Now it's time to take a closer look at the polling data. Did you notice these polls are done at the state level? Especially during the primaries, campaigns tend to focus on winning individual states, rather than maintaining popularity across the country. Instead of averaging across all states, it might make sense to view polling data state-by-state.
In this exercise, you'll compare polls in the Republican primaries across three important early states: Iowa, New Hampshire, and South Carolina. This time, you'll keep the confidence intervals to help see how the candidates compare.
`polls2`as well as the `ggplot2` and `gridExtra` packages are preloaded into your workspace.
*** =instructions
- Create a plot using `ggplot()` that includes only polling data for Republicans (`type == "rep"`) in the state of Iowa (`location == "IA"`) drawn from `polls2`. Save this plot to `ia_plot`.
- Create a second plot that includes only polling data for Republicans in New Hampshire (`location == "NH"`). Save this plot to `nh_plot`.
- Create a third plot that includes only polling data for Republicans in South Carolina (`location == "SC"`). Save this plot to `sc_plot`.
- Take a look at all three plots together using `grid.arrange()`.
*** =hint
- Use `subset(polls2, polls2$type == "rep" & location =="IA")` to select only Republican primary polls in Iowa.
- Make sure you set y = `percent`, x = `polldate`, and col = `candidate` in each plot.
*** =pre_exercise_code
```{r}
polls <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1635/datasets/polls.csv")
polls$polldate <- as.Date(polls$polldate)
polls$electiondate <- as.Date(polls$electiondate)
polls2 <- polls[polls$samplesize > 400, ]
library(ggplot2)
library(gridExtra)
```
*** =sample_code
```{r}
# Create a plot that includes only polling data for Republicans in Iowa: ia_plot
ia_plot <- ggplot(data = subset(___), aes(y = ___, x = ___, col = ___)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.7, na.rm = TRUE) +
labs(title="Iowa")
# Create another plot that includes only polling data for Republicans in New Hampshire: nh_plot
nh_plot <- ggplot(data = subset(___), aes(y = ___, x = ___, col = ___)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.7, na.rm = TRUE) +
labs(title="New Hampshire")
# Create another plot that includes only polling data for Republicans in South Carolina: sc_plot
sc_plot <- ggplot(data = subset(___), aes(y = ___, x = ___, col = ___)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.7, na.rm = TRUE) +
labs(title="South Carolina")
# Take a look at all three plots together (do not modify this command)
grid.arrange(ia_plot, nh_plot, sc_plot, nrow=3)
```
*** =solution
```{r}
# Create a plot that includes only polling data for Republicans in Iowa: ia_plot
ia_plot <- ggplot(data = subset(polls2, polls2$type == "rep" & location == "IA"), aes(y = percent, x = polldate, col = candidate)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.7, na.rm = TRUE) +
labs(title="Iowa")
# Create another plot that includes only polling data for Republicans in New Hampshire: nh_plot
nh_plot <- ggplot(data = subset(polls2, polls2$type == "rep" & location == "NH"), aes(y = percent, x = polldate, col = candidate)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.7, na.rm = TRUE) +
labs(title="New Hampshire")
# Create another plot that includes only polling data for Republicans in South Carolina: sc_plot
sc_plot <- ggplot(data = subset(polls2, polls2$type == "rep" & location == "SC"), aes(y = percent, x = polldate, col = candidate)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.7, na.rm = TRUE) +
labs(title="South Carolina")
# Take a look at all three plots together (do not modify this command)
grid.arrange(ia_plot, nh_plot, sc_plot, nrow=3)
```
*** =sct
```{r}
# SCT written with testwhat: https://github.com/datacamp/testwhat/wiki
test_function("ggplot", args = "data", index = 1,
not_called_msg = "You didn't call `ggplot()`!",
incorrect_msg = "You didn't call `ggplot(data=...)` with the correct argument")
test_function("ggplot", args = "mapping", index = 1)
test_function("ggplot", args = "data", index = 2,
not_called_msg = "You didn't call `ggplot()`!",
incorrect_msg = "You didn't call `ggplot(data=...)` with the correct argument")
test_function("ggplot", args = "mapping", index = 2)
test_function("ggplot", args = "data", index = 3,
not_called_msg = "You didn't call `ggplot()`!",
incorrect_msg = "You didn't call `ggplot(data=...)` with the correct argument")
test_function("ggplot", args = "mapping", index = 3)
test_error()
success_msg("Great job! It looks like Iowa was a tight race, but Donald Trump had large leads in both New Hampshire and South Carolina.")
```
--- type:MultipleChoiceExercise lang:r xp:50 skills:1 key:0d50c40c7f
## Stronger predictions
Now that you've refined and visualized your polling data, you should have an easier time drawing conclusions from polls across different units (in this case, different parties and states). But be wary - polls are still only an approximation of voter preferences.
Which of the following conclusions can you make from the plots you constructed in the previous exercise?
*** =instructions
- There is a 100% chance that Donald Trump will win New Hampshire.
- There was never a chance of Marco Rubio winning in South Carolina.
- John Kasich will not win in New Hampshire.
- We can't be sure who will win in Iowa.
*** =hint
Have a look at the plots. Which candidate appears to be leading in each state? What can these data tell us about voting outcomes?
*** =pre_exercise_code
```{r}
library(ggplot2)
library(gridExtra)
polls <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1635/datasets/polls.csv")
polls$polldate <- as.Date(polls$polldate)
polls$electiondate <- as.Date(polls$electiondate)
polls2 <- polls[polls$samplesize > 400, ]
ia_plot <- ggplot(data = subset(polls2, polls2$type == "rep" & location == "IA"), aes(y = percent, x = polldate, col = candidate)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.7, na.rm = TRUE) +
labs(title="Iowa")
nh_plot <- ggplot(data = subset(polls2, polls2$type == "rep" & location == "NH"), aes(y = percent, x = polldate, col = candidate)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.7, na.rm = TRUE) +
labs(title="New Hampshire")
sc_plot <- ggplot(data = subset(polls2, polls2$type == "rep" & location == "SC"), aes(y = percent, x = polldate, col = candidate)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
geom_smooth(span = 0.7, na.rm = TRUE) +
labs(title="South Carolina")
grid.arrange(ia_plot, nh_plot, sc_plot, nrow=3)
```
*** =sct
```{r}
# SCT written with testwhat: https://github.com/datacamp/testwhat/wiki
msg_bad <- "That's not quite right. Even the best polls are just estimates based on public opinion."
msg_success <- "Great job! Remember, even the strongest polls are limited."
test_mc(correct = 4, feedback_msgs = c(msg_bad, msg_bad, msg_bad, msg_success))
```
--- type:NormalExercise lang:r xp:100 skills:1 key:e7163b7798
## Mapping polling data
Now that you've looked at trends in a few important states, you have a good idea of the variation in polling from one state to another. A valuable way to visualize this variation is to attach polling data to a map of the United States.
In this exercise, you'll use the [maps](http://www.rdocumentation.org/packages/maps/versions/3.1.1) package to generate a map of candidate support across each state in your sample. The `maps` package contains coordinates for important geographic and political units worldwide which R can use to generate maps. Before you can generate a map from your data, you'll need to merge your polling data with geographic information for each state.
A new dataset containing only the final polling data for Bernie Sanders before each state's primary (`sanders`) has been preloaded into your workspace. For plotting purposes, you'll use the `ggplot2` package and the [ggthemes](http://www.rdocumentation.org/packages/ggthemes/versions/3.2.0) package, the latter of which provides some useful pre-defined themes for plots made with `ggplot()`. Both packages are preloaded for you.
*** =instructions
- Load the `maps` package to gain access to geographic datasets and save the `states` data as an object.
- Use [merge()](http://www.rdocumentation.org/packages/base/versions/3.3.1/topics/merge) to combine `states` and `sanders` on the column `region`. Save this new data frame as `sanders_map`.
- Reorder `sanders_map` according to `order` to make sure R knows how to plot your data.
- Use `ggplot()` to project Bernie Sanders' polling data onto a map of the United States. The `theme_map()` option (from the `ggthemes` package) is a convenient way to produce a crisp and clean map.
*** =hint
- Be sure to specify `states` and `sanders` as the two data frames for your `merge()` command (in that order).
- Order the `sanders_map` data according to the `order` column.
*** =pre_exercise_code
```{r}
sanders <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1635/datasets/sanders.csv")
library(ggplot2)
library(ggthemes)
```
*** =sample_code
```{r}
# Load the maps package and state data
library(maps)
states <- map_data("state")
# Merge states and sanders by region: sanders_map
sanders_map <- merge(___, ___, by = ___, all = TRUE)
# Reorder Sanders map according to the maps package order column
sanders_map <- ___[order(sanders_map$___ ), ]
# Use ggplot() to produce a map of Bernie Sanders polling data (do not modify this command)
ggplot() +
geom_polygon(data = sanders_map, aes(x = long, y = lat, group = group, fill = percent)) +
labs(title = "Support for Bernie Sanders at Last Poll Before Primary") +
theme_map()
```
*** =solution
```{r}
# Load the maps package
library(maps)
states <- map_data("state")
# Merge states and sanders by region: sanders_map
sanders_map <- merge(states, sanders, by = "region", all = TRUE)
# Reorder Sanders map according to the maps package order column
sanders_map <- sanders_map[order(sanders_map$order), ]
# Use ggplot() to produce a map of Bernie Sanders polling data (do not modify this command)
ggplot() +
geom_polygon(data = sanders_map, aes(x = long, y = lat, group = group, fill = percent)) +
labs(title = "Support for Bernie Sanders at Last Poll Before Primary") +
theme_map()
```
*** =sct
```{r}
# SCT written with testwhat: https://github.com/datacamp/testwhat/wiki
test_function("merge", args = c("x", "y"),
not_called_msg = "You didn't call `merge()`!",
incorrect_msg = "You didn't call `merge( x = ..., y = ...)` with the correct arguments")
test_object("states")
test_object("sanders_map")
test_error()
success_msg("Excellent! It looks like Bernie Sanders was polling very well before the Vermont primary. No surprise there. Sanders' polling numbers across the South are much lower.")
```
--- type:NormalExercise lang:r xp:100 skills:1 key:341d0fbe4d
## Interactive maps using googleVis
Your map looks excellent! Attaching polling data to a map allows for easy and intuitive identification of regional trends for each candidate.
However, a color gradient alone makes it difficult to identify specific polling numbers in each state. Ideally, you want your map to display general trends and specific information without being too cluttered. To accomplish this, you'll use the [googleVis](http://www.rdocumentation.org/packages/googleVis/versions/0.6.0) package to generate an interactive map from your Sanders polling data.
The `googleVis` package allows you to create interactive charts and maps in R by providing a direct interface to the Google Charts API. Unlike the `maps` package, maps in `googleVis` do not require you to attach geographic coordinates to your data as long as they contain relevant geographic names (in this case, states). The [gvisGeoChart()](http://www.rdocumentation.org/packages/googleVis/versions/0.6.0/topics/gvisGeoChart) command requires you to specify your data, a location variable, and a color variable.
The Bernie Sanders polling data (`sanders`) is preloaded into your environment.
*** =instructions
- Load the `googleVis` package.
- Generate a googleVis object (`sanders_gvis`) by specifying `sanders` as the data, setting the location variable to `location`, and setting the color variable to `percent`.
- Plot `sanders_gvis` using the base R plot command (`plot()`).
*** =hint
- The `gvisGeoChart()` command requires you to specify data, a location variable, and a color variable. Be sure to use quotes when specifying variables.
*** =pre_exercise_code
```{r}
sanders <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1635/datasets/sanders.csv")
suppressPackageStartupMessages(library(googleVis))
options(gvis.plot.tag = 'chart')
```
*** =sample_code
```{r}
# Load the googleVis package
# Generate a googleVis map object: sanders_gvis
sanders_gvis <- gvisGeoChart(data = ___, locationvar = , colorvar = ,
options=list(region="US",
displayMode="regions",
resolution="provinces"))
# Plot your googleVis object
```
*** =solution
```{r}
# Load the googleVis package
library(googleVis)
# Generate a googleVis map object: sanders_gvis
sanders_gvis <- gvisGeoChart(data = sanders, locationvar = "location", colorvar = "percent",
options=list(region="US",
displayMode="regions",
resolution="provinces"))
# Plot your googleVis object
plot(sanders_gvis)
```
*** =sct
```{r}
# SCT written with testwhat: https://github.com/datacamp/testwhat/wiki
test_function("gvisGeoChart", args = "data")
test_function("gvisGeoChart", args = "locationvar")
test_function("gvisGeoChart", args = "colorvar")
test_error()
success_msg("Great job! Now that you have an interactive map, you can get a general overview of trends across the United States as well as specific polling numbers for each state.")
```