-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
148 lines (110 loc) · 6.24 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# hildar
<!-- badges: start -->
[![R-CMD-check](https://github.com/asiripanich/hildar/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/asiripanich/hildar/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
<p align="center">
<img src="https://fbe.unimelb.edu.au/_nocache?a=3881339" title="source: https://fbe.unimelb.edu.au/_nocache?a=3881339" width="400"/>
</p>
[HILDA survey data](https://melbourneinstitute.unimelb.edu.au/hilda) is a large panel survey 20 waves (2001 - 2020) and counting!
Some waves have more than 5000 variables, which means reading them into R is a little challenging (personally I think it is wayyyyyyyy too slow).
The goal of this package is to provide a quick and easy way to query HILDA data into R.
This is possible by converting each wave of HILDA from its STATA file (`.dta`), one of the three formats HILDA provides, to `fst` format. [fst](https://github.com/fstpackage/fst) is a binary format and can be read much much quicker than `.dta` in R.
| Function name | Description |
|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `hil_setup()` | Setup HILDA fst files for `hil_fetch() to use`. |
| `hil_fetch()` | Fetches HILDA records based on query options. |
| `hil_dict()` | Shows HILDA data glossary and waves each variable is available in. This provides a convenient way to select multiple variables based on their description by passing it to `hil_fetch()`. |
| `hil_vars()` | Returns all variables where their variable names match a regular expression. |
| `hil_labs()` | Returns all variables where their labels match a regular expression. |
| `hil_browse()`| Opens up the HILDA data dictionary page on your default web browser. |
| `hil_crosswave_info()`| Takes a variable name and search for its cross wave information. |
| `hil_var_details()`| Similar to `hil_crosswave_info()` but it searches for a variable's details. |
## Installation
The development version from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("asiripanich/hildar")
```
## Setup
### 1) Store HILDA as `.fst` files
Use `hil_setup()` to read HILDA STATA (.dta) files and save them as `.fst` files.
`.fst` is a binary data format that can be read very quickly, a lot faster than `.dta`.
An additional benefit of `hil_setup()` is that it creates a HILDA dictionary file that you can later call using `hil_dict()`.
Having a functional `hil_dict()` allows the user to use `hil_vars()` and `hil_labs()` for searching variable names using a regular expression.
This will allow you to fast query HILDA data from all waves using `hil_fetch()`.
``` r
hil_setup(
read_dir = "/path/to/your/hilda-stata-files",
save_dir = "/path/to/save/hilda-fst-files"
)
```
To speed up the setup process, you can select a `future` parallel backend and call it before running your `hil_setup()`.
``` r
library(future)
plan(multisession, workers = 2)
# `hil_setup()` can take several minutes to finish.
# To monitor its progress, you can wrap the function in
# `progressr::with_progress({...}}` like below.
progressr::with_progress({
hil_setup(read_dir = "...", save_dir = "...")
})
```
### 2) Tell `hildar` where the HILDA `.fst` files are stored at.
`hil_fetch()` requires the user to specify where the HILDA fst files generated in the previous step are stored.
You can either set this `HILDA_FST` as a global option or an R environment variable.
Setting this as a persistent option for all your R sessions will make `hil_fetch()` more convinient to use.
Alternatively, you can manually set it in each call using `hilda_fst_dir` argument in `hil_fetch()`.
## Example
Once the setup is completed, you can now start fetching HILDA data with `hildar`!
```{r}
library(hildar)
# fetch removes the HILDA year prefix from all the selected variable
# (e.g. axxx = 2001, bxxx = 2002).
hil_fetch(years = 2001:2003, add_geography = T) %>%
summary()
```
There is a quick option to add basic demographic variables to the data, which is set to TRUE by default.
```{r}
hil_fetch(years = 2001, add_basic_vars = T) %>%
names()
```
How about doing a quick search to find variables that you want? Use `hil_dict` which is a data.table that you can search or view HILDA variables without going to their documentation webpage.
```{r}
hilda_dictionary <- hil_dict()
head(hilda_dictionary)
```
Let say we want to select all variables that are related to 'employment'. Here is how we can easily use the selected employment variables in `hil_fetch()`.
```{r}
hilda_data <- hil_fetch(years = 2001:2003, vars = hil_labs("employment"))
dim(hilda_data)
```
Or if you know the prefix of a subject area that you like to query, you can use `hil_vars(pattern)` to query all variable names that match the pattern.
For example, `hil_vars("^ff")` will get all variables in subject area 'Health' and nested area 'Heath - diet'.
```{r}
hilda_data <- hil_fetch(years = 2001:2003, vars = hil_vars("^ff"))
dim(hilda_data)
```
To set default variables to be loaded every time you call `hil_fetch()`, see `hil_user_default_vars()`.
Here is a summary of the dimensions of our HILDA data files.
```{r}
# the number of variables and rows in each wave
nrows_by_wave <-
hil_fetch(years = 2001:2020, add_basic_vars = F) %>%
.[, .(number_of_rows = .N), by = wave]
hilda_dictionary[, unlist(wave), by = .(var, label)] %>%
data.table::setnames("V1", "wave") %>%
.[!is.na(wave), .(number_of_variables = .N), by = wave] %>%
merge(nrows_by_wave, by = "wave")
```