written by Junvie Pailden
For this week, we need to use the package mosaic
covered in last week's lesson. We also need to use another package called dplyr
. The dplyr
package contains functions that are often used in combination to manage and manipulate data sets in any manner. The data set we need in this lesson is found in the package gapminder
.
# install the necessary packages if it doesn't exist
if (!require(mosaic)) install.packages(`mosaic`)
if (!require(dplyr)) install.packages(`dplyr`)
if (!require(gapminder)) install.packages(`gapminder`)
# load the packages
library(mosaic)
library(dplyr)
library(gapminder)
The dplyr package provide a collection of convenient "verbs" for manipulating data frames. It aims to provide a function for each basic verb of data manipulation:
filter()
to select cases based on their values.arrange()
to reorder the cases.select()
andrename()
to select variables based on their names.mutate()
andtransmute()
to add new variables that are functions of existing variables.summarise()
to condense multiple values to a single value.sample_n()
andsample_frac()
to take random samples.
Each dplyr verb takes a data frame as input and returns a modified version of it.
The Gapminder package contains data on life expectancy, GDP per capita, and population by country. The main data frame gapminder
has 1704 rows and 6 variables. Check ?gapminder
for more information on the data set.
The gapminder
data frame is a special kind of data frame: a tibble. Compared to a data.frame
, tibbles have a refined print method that shows the first 10 rows and columns that fit on the screen - useful when working with moderate to large data sets.
dim(gapminder)
# [1] 1704 6
gapminder
# # A tibble: 1,704 x 6
# country continent year lifeExp pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453
# 2 Afghanistan Asia 1957 30.332 9240934 820.8530
# 3 Afghanistan Asia 1962 31.997 10267083 853.1007
# 4 Afghanistan Asia 1967 34.020 11537966 836.1971
# 5 Afghanistan Asia 1972 36.088 13079460 739.9811
# 6 Afghanistan Asia 1977 38.438 14880372 786.1134
# 7 Afghanistan Asia 1982 39.854 12881816 978.0114
# 8 Afghanistan Asia 1987 40.822 13867957 852.3959
# 9 Afghanistan Asia 1992 41.674 16317921 649.3414
# 10 Afghanistan Asia 1997 41.763 22227415 635.3414
# # ... with 1,694 more rows
A quick visualization of the data confirms a widely accepted observation that GDP per capital is positive correlated to life expectancy.
xyplot( lifeExp ~ log10(gdpPercap) | continent, # separate panels for continent
data = gapminder,
cex = 0.4, # shrink the points
type = c("p", "r")) # fit a regression line on top of the points
lwd = 3 # set line width
filter()
takes logical expressions and returns the rows for which all are TRUE. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.
For example, we can select all countries with the following arguments
- life expectancy below 30 yrs
- United States for years > 1990
- selected countries and years > 2000
filter(gapminder, lifeExp < 30)
# # A tibble: 2 x 6
# country continent year lifeExp pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453
# 2 Rwanda Africa 1992 23.599 7290203 737.0686
filter(gapminder, country == "United States", year > 1990)
# # A tibble: 4 x 6
# country continent year lifeExp pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 United States Americas 1992 76.090 256894189 32003.93
# 2 United States Americas 1997 76.810 272911760 35767.43
# 3 United States Americas 2002 77.310 287675526 39097.10
# 4 United States Americas 2007 78.242 301139947 42951.65
filter(gapminder, country %in% c("Rwanda", "United States", "Japan"), year > 2000)
# # A tibble: 6 x 6
# country continent year lifeExp pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 Japan Asia 2002 82.000 127065841 28604.5919
# 2 Japan Asia 2007 82.603 127467972 31656.0681
# 3 Rwanda Africa 2002 43.413 7852401 785.6538
# 4 Rwanda Africa 2007 46.242 8860588 863.0885
# 5 United States Americas 2002 77.310 287675526 39097.0995
# 6 United States Americas 2007 78.242 301139947 42951.6531
slice(gapminder, 1:8)
# # A tibble: 8 x 6
# country continent year lifeExp pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453
# 2 Afghanistan Asia 1957 30.332 9240934 820.8530
# 3 Afghanistan Asia 1962 31.997 10267083 853.1007
# 4 Afghanistan Asia 1967 34.020 11537966 836.1971
# 5 Afghanistan Asia 1972 36.088 13079460 739.9811
# 6 Afghanistan Asia 1977 38.438 14880372 786.1134
# 7 Afghanistan Asia 1982 39.854 12881816 978.0114
# 8 Afghanistan Asia 1987 40.822 13867957 852.3959
arrange()
works similarly to filter()
except that instead of filtering or selecting rows, it reorders them.
To arrange gapminder
in ascending order of life expectancy.
arrange(gapminder, lifeExp)
# # A tibble: 1,704 x 6
# country continent year lifeExp pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 Rwanda Africa 1992 23.599 7290203 737.0686
# 2 Afghanistan Asia 1952 28.801 8425333 779.4453
# 3 Gambia Africa 1952 30.000 284320 485.2307
# 4 Angola Africa 1952 30.015 4232095 3520.6103
# 5 Sierra Leone Africa 1952 30.331 2143249 879.7877
# 6 Afghanistan Asia 1957 30.332 9240934 820.8530
# 7 Cambodia Asia 1977 31.220 6978607 524.9722
# 8 Mozambique Africa 1952 31.286 6446316 468.5260
# 9 Sierra Leone Africa 1957 31.570 2295678 1004.4844
# 10 Burkina Faso Africa 1952 31.975 4469979 543.2552
# # ... with 1,694 more rows
# Use `desc()` to order a column in descending order:
arrange(gapminder, desc(lifeExp))
# # A tibble: 1,704 x 6
# country continent year lifeExp pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 Japan Asia 2007 82.603 127467972 31656.07
# 2 Hong Kong, China Asia 2007 82.208 6980412 39724.98
# 3 Japan Asia 2002 82.000 127065841 28604.59
# 4 Iceland Europe 2007 81.757 301931 36180.79
# 5 Switzerland Europe 2007 81.701 7554661 37506.42
# 6 Hong Kong, China Asia 2002 81.495 6762476 30209.02
# 7 Australia Oceania 2007 81.235 20434176 34435.37
# 8 Spain Europe 2007 80.941 40448191 28821.06
# 9 Sweden Europe 2007 80.884 9031088 33859.75
# 10 Israel Asia 2007 80.745 6426679 25523.28
# # ... with 1,694 more rows
When you are only interested in a few columns, use select()
to quickly narrow down a subset using operations that only work on numeric variable positions.
# Select columns population and life expectancy by name
select(gapminder, lifeExp, pop)
# # A tibble: 1,704 x 2
# lifeExp pop
# <dbl> <int>
# 1 28.801 8425333
# 2 30.332 9240934
# 3 31.997 10267083
# 4 34.020 11537966
# 5 36.088 13079460
# 6 38.438 14880372
# 7 39.854 12881816
# 8 40.822 13867957
# 9 41.674 16317921
# 10 41.763 22227415
# # ... with 1,694 more rows
# Select all columns from year to population
select(gapminder, year:pop)
# # A tibble: 1,704 x 3
# year lifeExp pop
# <int> <dbl> <int>
# 1 1952 28.801 8425333
# 2 1957 30.332 9240934
# 3 1962 31.997 10267083
# 4 1967 34.020 11537966
# 5 1972 36.088 13079460
# 6 1977 38.438 14880372
# 7 1982 39.854 12881816
# 8 1987 40.822 13867957
# 9 1992 41.674 16317921
# 10 1997 41.763 22227415
# # ... with 1,694 more rows
# Select all columns except those from year to population
select(gapminder, -(year:pop))
# # A tibble: 1,704 x 3
# country continent gdpPercap
# <fctr> <fctr> <dbl>
# 1 Afghanistan Asia 779.4453
# 2 Afghanistan Asia 820.8530
# 3 Afghanistan Asia 853.1007
# 4 Afghanistan Asia 836.1971
# 5 Afghanistan Asia 739.9811
# 6 Afghanistan Asia 786.1134
# 7 Afghanistan Asia 978.0114
# 8 Afghanistan Asia 852.3959
# 9 Afghanistan Asia 649.3414
# 10 Afghanistan Asia 635.3414
# # ... with 1,694 more rows
You can also rename variables with select()
by using named arguments.
select(gapminder, life_expectancy = lifeExp)
# # A tibble: 1,704 x 1
# life_expectancy
# <dbl>
# 1 28.801
# 2 30.332
# 3 31.997
# 4 34.020
# 5 36.088
# 6 38.438
# 7 39.854
# 8 40.822
# 9 41.674
# 10 41.763
# # ... with 1,694 more rows
But because select()
drops all the variables not explicitly mentioned, it's not often useful. Instead, use rename()
rename(gapminder, life_expectancy = lifeExp)
# # A tibble: 1,704 x 6
# country continent year life_expectancy pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453
# 2 Afghanistan Asia 1957 30.332 9240934 820.8530
# 3 Afghanistan Asia 1962 31.997 10267083 853.1007
# 4 Afghanistan Asia 1967 34.020 11537966 836.1971
# 5 Afghanistan Asia 1972 36.088 13079460 739.9811
# 6 Afghanistan Asia 1977 38.438 14880372 786.1134
# 7 Afghanistan Asia 1982 39.854 12881816 978.0114
# 8 Afghanistan Asia 1987 40.822 13867957 852.3959
# 9 Afghanistan Asia 1992 41.674 16317921 649.3414
# 10 Afghanistan Asia 1997 41.763 22227415 635.3414
# # ... with 1,694 more rows
distinct(gapminder, continent)
# # A tibble: 5 x 1
# continent
# <fctr>
# 1 Asia
# 2 Europe
# 3 Africa
# 4 Americas
# 5 Oceania
distinct(gapminder, country, continent)
# # A tibble: 142 x 2
# country continent
# <fctr> <fctr>
# 1 Afghanistan Asia
# 2 Albania Europe
# 3 Algeria Africa
# 4 Angola Africa
# 5 Argentina Americas
# 6 Australia Oceania
# 7 Austria Europe
# 8 Bahrain Asia
# 9 Bangladesh Asia
# 10 Belgium Europe
# # ... with 132 more rows
We often want to string together a series of dplyr functions. This is achieved using dplyr's pipe operator, %>%
. This takes the value on the left, and passes it as the first argument to the function call on the right.
# input +--------+ +--------+ +--------+ result
# data %>% | verb | %>% | verb | %>% | verb | -> data
# frame +--------+ +--------+ +--------+ frame
Say we want to extract the data for Rwanda, but only for selected variables.
gapminder %>%
filter(country == "Rwanda", year > 1980) %>%
select(year, lifeExp)
# # A tibble: 6 x 2
# year lifeExp
# <int> <dbl>
# 1 1982 46.218
# 2 1987 44.020
# 3 1992 23.599
# 4 1997 36.087
# 5 2002 43.413
# 6 2007 46.242
Suppose we wanted to recover each country's GDP. The Gapminder data has a variable for population and GDP per capita so we only need to multiply them together.
gapminder %>%
mutate(gdp = pop * gdpPercap) %>%
select(country, year, gdp)
# # A tibble: 1,704 x 3
# country year gdp
# <fctr> <int> <dbl>
# 1 Afghanistan 1952 6567086330
# 2 Afghanistan 1957 7585448670
# 3 Afghanistan 1962 8758855797
# 4 Afghanistan 1967 9648014150
# 5 Afghanistan 1972 9678553274
# 6 Afghanistan 1977 11697659231
# 7 Afghanistan 1982 12598563401
# 8 Afghanistan 1987 11820990309
# 9 Afghanistan 1992 10595901589
# 10 Afghanistan 1997 14121995875
# # ... with 1,694 more rows
summarise()
collapses a data frame to a single row according to some summary measure (e.g. mean
).
If we are interested to compute the average life expectancy in the data set.
gapminder %>%
summarise(ave.lifeExp = mean(lifeExp))
# # A tibble: 1 x 1
# ave.lifeExp
# <dbl>
# 1 59.47444
We can use group_by()
to get the life expectancy of selected (filter()
) countries across time.
countryS <- c("United States", "Sweden", "Thailand", "China", "Iraq", "Afghanistan")
compare.lifeExp <- gapminder %>%
filter(country %in% countryS) %>%
group_by(year, country)
compare.lifeExp
# # A tibble: 72 x 6
# # Groups: year, country [72]
# country continent year lifeExp pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453
# 2 Afghanistan Asia 1957 30.332 9240934 820.8530
# 3 Afghanistan Asia 1962 31.997 10267083 853.1007
# 4 Afghanistan Asia 1967 34.020 11537966 836.1971
# 5 Afghanistan Asia 1972 36.088 13079460 739.9811
# 6 Afghanistan Asia 1977 38.438 14880372 786.1134
# 7 Afghanistan Asia 1982 39.854 12881816 978.0114
# 8 Afghanistan Asia 1987 40.822 13867957 852.3959
# 9 Afghanistan Asia 1992 41.674 16317921 649.3414
# 10 Afghanistan Asia 1997 41.763 22227415 635.3414
# # ... with 62 more rows
xyplot(lifeExp ~ year | country, # separate panels the selected countries
data = compare.lifeExp,
type = "l") # line type)
We can also pipe group_by()
and summarise()
to get the average life expectancy grouped by year and continent.
continent.lifeExp <- gapminder %>%
group_by(year, continent) %>%
summarise(ave.lifeExp = mean(lifeExp))
continent.lifeExp
# # A tibble: 60 x 3
# # Groups: year [?]
# year continent ave.lifeExp
# <int> <fctr> <dbl>
# 1 1952 Africa 39.13550
# 2 1952 Americas 53.27984
# 3 1952 Asia 46.31439
# 4 1952 Europe 64.40850
# 5 1952 Oceania 69.25500
# 6 1957 Africa 41.26635
# 7 1957 Americas 55.96028
# 8 1957 Asia 49.31854
# 9 1957 Europe 66.70307
# 10 1957 Oceania 70.29500
# # ... with 50 more rows
xyplot(ave.lifeExp ~ year,
groups = continent, # separate lines for the continents
data = continent.lifeExp,
type = "l", # line type
lwd = 2, # set line width
auto.key = list(space = "right", # place legend on the right
points = FALSE, # omit point in legend
lines = TRUE))
You can use sample_n()
and sample_frac()
to take a random sample of rows: use sample_n()
for a fixed number and sample_frac()
for a fixed fraction.
sample_n(gapminder, 10) # randomly sample 10 rows
# # A tibble: 10 x 6
# country continent year lifeExp pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 Guinea Africa 1982 42.891 4710497 857.2504
# 2 Comoros Africa 1972 48.944 250027 1937.5777
# 3 Belgium Europe 1997 77.530 10199787 27561.1966
# 4 Nicaragua Americas 1952 42.314 1165790 3112.3639
# 5 Jordan Asia 1997 69.772 4526235 3645.3796
# 6 Singapore Asia 1977 70.795 2325300 11210.0895
# 7 Iraq Asia 1952 45.320 5441766 4129.7661
# 8 Jordan Asia 2007 72.535 6053193 4519.4612
# 9 Swaziland Africa 1972 49.552 480105 3364.8366
# 10 Equatorial Guinea Africa 1967 38.987 259864 915.5960
sample_frac(gapminder, 0.01) # randomly sample 1% of the data
# # A tibble: 17 x 6
# country continent year lifeExp pop gdpPercap
# <fctr> <fctr> <int> <dbl> <int> <dbl>
# 1 Colombia Americas 1997 70.313 37657830 6117.3617
# 2 Singapore Asia 1997 77.158 3802309 33519.4766
# 3 Uruguay Americas 1962 68.253 2598466 5603.3577
# 4 Brazil Americas 1972 59.504 100840058 4985.7115
# 5 Botswana Africa 1952 47.622 442308 851.2411
# 6 Poland Europe 1982 71.320 36227381 8451.5310
# 7 Bulgaria Europe 1987 71.340 8971958 8239.8548
# 8 Kuwait Asia 1952 55.565 160000 108382.3529
# 9 Namibia Africa 1987 60.835 1278184 3693.7313
# 10 Gambia Africa 2002 58.041 1457766 660.5856
# 11 Portugal Europe 2002 77.290 10433867 19970.9079
# 12 Afghanistan Asia 2002 42.129 25268405 726.7341
# 13 Thailand Asia 1977 62.494 44148285 1961.2246
# 14 Taiwan Asia 1987 73.400 19757799 11054.5618
# 15 Nepal Asia 1992 55.727 20326209 897.7404
# 16 Norway Europe 1972 74.340 3933004 18965.0555
# 17 Germany Europe 1977 72.500 78160773 20512.9212
Use replace = TRUE
to perform random selection with replacement. More on this next week.