-
Notifications
You must be signed in to change notification settings - Fork 230
/
import.Rmd
368 lines (264 loc) · 11 KB
/
import.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
---
output: html_document
editor_options:
chunk_output_type: console
---
# Data import {#data-import .r4ds-section}
## Introduction {#introduction-5 .r4ds-section}
```{r results='hide',message=FALSE,cache=FALSE}
library("tidyverse")
```
## Getting started {#getting-started .r4ds-section}
### Exercise 11.2.1 {.unnumbered .exercise data-number="11.2.1"}
<div class="question">
What function would you use to read a file where fields were separated with “|”?
</div>
<div class="answer">
Use the `read_delim()` function with the argument `delim="|"`.
```{r eval=FALSE}
read_delim(file, delim = "|")
```
</div>
### Exercise 11.2.2 {.unnumbered .exercise data-number="11.2.2"}
<div class="question">
Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common?
</div>
<div class="answer">
They have the following arguments in common:
```{r}
intersect(names(formals(read_csv)), names(formals(read_tsv)))
```
- `col_names` and `col_types` are used to specify the column names and how to parse the columns
- `locale` is important for determining things like the encoding and whether "." or "," is used as a decimal mark.
- `na` and `quoted_na` control which strings are treated as missing values when parsing vectors
- `trim_ws` trims whitespace before and after cells before parsing
- `n_max` sets how many rows to read
- `guess_max` sets how many rows to use when guessing the column type
- `progress` determines whether a progress bar is shown.
In fact, the two functions have the exact same arguments:
```{r}
identical(names(formals(read_csv)), names(formals(read_tsv)))
```
</div>
### Exercise 11.2.3 {.unnumbered .exercise data-number="11.2.3"}
<div class="question">
What are the most important arguments to `read_fwf()`?
</div>
<div class="answer">
The most important argument to `read_fwf()` which reads "fixed-width formats", is `col_positions` which tells the function where data columns begin and end.
</div>
### Exercise 11.2.4 {.unnumbered .exercise data-number="11.2.4"}
<div class="question">
Sometimes strings in a CSV file contain commas.
To prevent them from causing problems they need to be surrounded by a quoting character, like `"` or `'`.
By convention, `read_csv()` assumes that the quoting character will be `"`, and if you want to change it you’ll need to use `read_delim()` instead.
What arguments do you need to specify to read the following text into a data frame?
```
"x,y\n1,'a,b'"
```
</div>
<div class="answer">
For `read_delim()`, we will will need to specify a delimiter, in this case `","`, and a quote argument.
```{r}
x <- "x,y\n1,'a,b'"
read_delim(x, ",", quote = "'")
```
However, this question is out of date. `read_csv()` now supports a quote argument, so the following code works.
```{r}
read_csv(x, quote = "'")
```
</div>
### Exercise 11.2.5 {.unnumbered .exercise data-number="11.2.5"}
<div class="question">
Identify what is wrong with each of the following inline CSV files.
What happens when you run the code?
</div>
<div class="answer">
```{r}
read_csv("a,b\n1,2,3\n4,5,6")
```
Only two columns are specified in the header "a" and "b", but the rows have three columns, so the last column is dropped.
```{r}
read_csv("a,b,c\n1,2\n1,2,3,4")
```
The numbers of columns in the data do not match the number of columns in the header (three).
In row one, there are only two values, so column `c` is set to missing.
In row two, there is an extra value, and that value is dropped.
```{r}
read_csv("a,b\n\"1")
```
It's not clear what the intent was here.
The opening quote `"1` is dropped because it is not closed, and `a` is treated as an integer.
```{r}
read_csv("a,b\n1,2\na,b")
```
Both "a" and "b" are treated as character vectors since they contain non-numeric strings.
This may have been intentional, or the author may have intended the values of the columns to be "1,2" and "a,b".
```{r}
read_csv("a;b\n1;3")
```
The values are separated by ";" rather than ",". Use `read_csv2()` instead:
```{r}
read_csv2("a;b\n1;3")
```
</div>
## Parsing a vector {#parsing-a-vector .r4ds-section}
### Exercise 11.3.1 {.unnumbered .exercise data-number="11.3.1"}
<div class="question">
What are the most important arguments to `locale()`?
</div>
<div class="answer">
The locale object has arguments to set the following:
- date and time formats: `date_names`, `date_format`, and `time_format`
- time zone: `tz`
- numbers: `decimal_mark`, `grouping_mark`
- encoding: `encoding`
</div>
### Exercise 11.3.2 {.unnumbered .exercise data-number="11.3.2"}
<div class="question">
What happens if you try and set `decimal_mark` and `grouping_mark` to the same character?
What happens to the default value of `grouping_mark` when you set `decimal_mark` to `","`?
What happens to the default value of `decimal_mark` when you set the `grouping_mark` to `"."`?
</div>
<div class="answer">
If the decimal and grouping marks are set to the same character, `locale` throws an error:
```{r error=TRUE}
locale(decimal_mark = ".", grouping_mark = ".")
```
If the `decimal_mark` is set to the comma "`,"`, then the grouping mark is set to the period `"."`:
```{r}
locale(decimal_mark = ",")
```
If the grouping mark is set to a period, then the decimal mark is set to a comma
```{r}
locale(grouping_mark = ".")
```
</div>
### Exercise 11.3.3 {.unnumbered .exercise data-number="11.3.3"}
<div class="question">
I didn’t discuss the `date_format` and `time_format` options to `locale()`.
What do they do?
Construct an example that shows when they might be useful.
</div>
<div class="answer">
They provide default date and time formats.
The [readr vignette](https://cran.r-project.org/web/packages/readr/vignettes/locales.html) discusses using these to parse dates: since dates can include languages specific weekday and month names, and different conventions for specifying AM/PM
```{r}
locale()
```
Examples from the readr vignette of parsing French dates
```{r}
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
parse_date("14 oct. 1979", "%d %b %Y", locale = locale("fr"))
```
Both the date format and time format are used for guessing column types.
Thus if you were often parsing data that had non-standard formats for the date and time, you could specify custom values for `date_format` and `time_format`.
```{r}
locale_custom <- locale(date_format = "Day %d Mon %M Year %y",
time_format = "Sec %S Min %M Hour %H")
date_custom <- c("Day 01 Mon 02 Year 03", "Day 03 Mon 01 Year 01")
parse_date(date_custom)
parse_date(date_custom, locale = locale_custom)
time_custom <- c("Sec 01 Min 02 Hour 03", "Sec 03 Min 02 Hour 01")
parse_time(time_custom)
parse_time(time_custom, locale = locale_custom)
```
</div>
### Exercise 11.3.4 {.unnumbered .exercise data-number="11.3.4"}
<div class="question">
If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
</div>
<div class="answer">
Read the help page for `locale()` using `?locale` to learn about the different variables that can be set.
As an example, consider Australia.
Most of the defaults values are valid, except that the date format is "(d)d/mm/yyyy", meaning that January 2, 2006 is written as `02/01/2006`.
However, default locale will parse that date as February 1, 2006.
```{r}
parse_date("02/01/2006")
```
To correctly parse Australian dates, define a new `locale` object.
```{r}
au_locale <- locale(date_format = "%d/%m/%Y")
```
Using `parse_date()` with the `au_locale` as its locale will correctly parse our example date.
```{r}
parse_date("02/01/2006", locale = au_locale)
```
</div>
### Exercise 11.3.5 {.unnumbered .exercise data-number="11.3.5"}
<div class="question">
What’s the difference between `read_csv()` and `read_csv2()`?
</div>
<div class="answer">
The delimiter. The function `read_csv()` uses a comma, while `read_csv2()` uses a semi-colon (`;`). Using a semi-colon is useful when commas are used as the decimal point (as in Europe).
</div>
### Exercise 11.3.6 {.unnumbered .exercise data-number="11.3.6"}
<div class="question">
What are the most common encodings used in Europe?
What are the most common encodings used in Asia?
Do some googling to find out.
</div>
<div class="answer">
UTF-8 is standard now, and ASCII has been around forever.
For the European languages, there are separate encodings for Romance languages and Eastern European languages using Latin script, Cyrillic, Greek, Hebrew, Turkish: usually with separate ISO and Windows encoding standards.
There is also Mac OS Roman.
For Asian languages Arabic and Vietnamese have ISO and Windows standards. The other major Asian scripts have their own:
- Japanese: JIS X 0208, Shift JIS, ISO-2022-JP
- Chinese: GB 2312, GBK, GB 18030
- Korean: KS X 1001, EUC-KR, ISO-2022-KR
The list in the documentation for `stringi::stri_enc_detect()` is a good list of encodings since it supports the most common encodings.
- Western European Latin script languages: ISO-8859-1, Windows-1250 (also CP-1250 for code-point)
- Eastern European Latin script languages: ISO-8859-2, Windows-1252
- Greek: ISO-8859-7
- Turkish: ISO-8859-9, Windows-1254
- Hebrew: ISO-8859-8, IBM424, Windows 1255
- Russian: Windows 1251
- Japanese: Shift JIS, ISO-2022-JP, EUC-JP
- Korean: ISO-2022-KR, EUC-KR
- Chinese: GB18030, ISO-2022-CN (Simplified), Big5 (Traditional)
- Arabic: ISO-8859-6, IBM420, Windows 1256
For more information on character encodings see the following sources.
- The Wikipedia page [Character encoding](https://en.wikipedia.org/wiki/Character_encoding), has a good list of encodings.
- Unicode [CLDR](http://cldr.unicode.org/) project
- [What is the most common encoding of each language](https://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language) (Stack Overflow)
- "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text", <http://kunststube.net/encoding/>.
Programs that identify the encoding of text include:
- `readr::guess_encoding()`
- `stringi::str_enc_detect()`
- [iconv](https://en.wikipedia.org/wiki/Iconv)
- [chardet](https://github.com/chardet/chardet) (Python)
</div>
### Exercise 11.3.7 {.unnumbered .exercise data-number="11.3.7"}
<div class="question">
Generate the correct format string to parse each of the following dates and times:
</div>
<div class="answer">
```{r}
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
```
The correct formats are:
```{r}
parse_date(d1, "%B %d, %Y")
parse_date(d2, "%Y-%b-%d")
parse_date(d3, "%d-%b-%Y")
parse_date(d4, "%B %d (%Y)")
parse_date(d5, "%m/%d/%y")
parse_time(t1, "%H%M")
```
The time `t2` uses real seconds,
```{r}
parse_time(t2, "%H:%M:%OS %p")
```
</div>
## Parsing a file {#parsing-a-file .r4ds-section}
`r no_exercises()`
## Writing to a file {#writing-to-a-file .r4ds-section}
`r no_exercises()`
## Other types of data {#other-types-of-data .r4ds-section}
`r no_exercises()`