written by Junvie Pailden
The mosaic
package was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic).
# install mosaic package if it doesn't exist
if (!require(mosaic)) install.packages(`mosaic`)
# load the package in R
library(mosaic)
Recall the data set chickwts
in the package datasets
included with every R installation. We can compute the mean of the weight
variable in the data chickwts
. The tilde ~
sign is used to select the variable. The argument data = chickwts
says that the variable weight
is in chickwts
.
mean(~ weight, data = chickwts)
# [1] 261
The mean
function in the mosaic
package supports formula interface common to regression and anova models (more on this later). The same output can be obtained using $
notation.
mean(chickwts$weight)
# [1] 261
We can also tally the count or frequency of various feed supplements.
tally(~ feed, data = chickwts)
# feed
# casein horsebean linseed meatmeal soybean sunflower
# 12 10 12 11 14 12
The formula interface allows us to compute the mean weight for every feed supplements. The argument weight ~ feed
says that you want the weights for each type of feed.
mean(weight ~ feed, data = chickwts)
# casein horsebean linseed meatmeal soybean sunflower
# 324 160 219 277 246 329
We can also compute other numerical summaries such as the median()
, variance var()
, standard deviation sd()
, etc.
median(~ weight, data = chickwts) # median for all weights
# [1] 258
median(weight ~ feed, data = chickwts) # median weight for every type of feed
# casein horsebean linseed meatmeal soybean sunflower
# 342 152 221 263 248 328
sd(~ weight, data = chickwts) # std. deviation for all weights
# [1] 78.1
sd(weight ~ feed, data = chickwts) # std. deviation weight for every type of feed
# casein horsebean linseed meatmeal soybean sunflower
# 64.4 38.6 52.2 64.9 54.1 48.8
Another handy function in the mosaic
package is favstats()
which outputs the
- five-number summary
- mean
- standard deviation
- count
- number missing values
favstats(~ weight, data = chickwts) # for all weights
# min Q1 median Q3 max mean sd n missing
# 108 204 258 324 423 261 78.1 71 0
favstats(weight ~ feed, data = chickwts) # for every type of feed
# feed min Q1 median Q3 max mean sd n missing
# 1 casein 216 277 342 371 404 324 64.4 12 0
# 2 horsebean 108 137 152 176 227 160 38.6 10 0
# 3 linseed 141 178 221 258 309 219 52.2 12 0
# 4 meatmeal 153 250 263 320 380 277 64.9 11 0
# 5 soybean 158 207 248 270 329 246 54.1 14 0
# 6 sunflower 226 313 328 340 423 329 48.8 12 0
The mosaic
package also includes commands for common graphical summaries.
Bargraph for Categorical Variables
bargraph(~ feed, data = chickwts)
Dot Plots are used often to describe small size numerical data sets.
dotPlot(~ weight, data = chickwts)
Similarly, Stem Plots are also used to describe small size numerical data sets.
stem(chickwts$weight)
#
# The decimal point is 2 digit(s) to the right of the |
#
# 1 | 124444
# 1 | 5566777889
# 2 | 00112223333444
# 2 | 5556666667778
# 3 | 0001222222333334444
# 3 | 5678899
# 4 | 02
Histograms are used often to describe moderate to large size numerical data sets.
histogram(~ weight, data = chickwts)
The formula interface syntax is used for graphical summaries, numerical summaries, and inference procedures.
goal(y ~ x | z, data = ..., groups = ...)
For plots,
-
y
: y-axis variable -
x
: x-axis variable -
z
: z-axis variable -
groups
: conditioning variable (overlaid graphs)
Dotplots for weight
across different feed
panels.
dotPlot(~ weight | feed, data = chickwts, cex = 0.8) # reduce the size of the dots by 20%
Boxplots for weights
by different feeds
in the same panel.
bwplot(weight ~ feed, data = chickwts)
Consider the Going Wireless Data from the previous week.
wireless.data <- read.csv("https://goo.gl/72BKSf", header = TRUE)
Descriptive summaries for the Going Wireless data
favstats(~ Wireless, data = wireless.data)
# min Q1 median Q3 max mean sd n missing
# 5.1 10.8 15.2 19 25.5 14.8 5.34 51 0
favstats(Wireless ~ Region, data = wireless.data)
# Region min Q1 median Q3 max mean sd n missing
# 1 E 5.1 8.65 11.4 15.2 20.6 11.9 4.59 19 0
# 2 M 6.4 15.10 16.9 21.1 23.2 17.4 4.55 19 0
# 3 W 8.0 10.80 16.3 18.9 25.5 15.3 5.66 13 0
Graphical summaries for the Going Wireless data
dotPlot(~ Wireless | Region, data = wireless.data) # dotplot
bwplot(Wireless ~ Region, data = wireless.data) # boxplots
Recall the flight delays data from last week's session. Load this data set using the function read.csv()
we learned last week. Be sure to check that your working directory contains the data set (flight.delay.csv
); otherwise, you need to change the working directory - see last week's lesson on how to do this.
getwd() # location of current working directory
# [1] "/Users/JPMac/Google Drive/SIUE_Class/rstatlab/rstatlab/data"
delay <- read.csv("flight.delay.csv", header = TRUE) # load the data
favstats(~ Rate.per.10K.Flights, data = delay)
# min Q1 median Q3 max mean sd n missing
# 0.1 0.8 1.2 1.6 4.9 1.61 1.3 17 0
histogram(~ Rate.per.10K.Flights, data = delay)
Scatterplot with Delays
on the horizontal axis and Rate.per.10K.Flights
on the vertical axis. Scatterplot's are mainly used to visualize the relationship between two numerical variables.
xyplot(Rate.per.10K.Flights ~ Delays, data = delay)
It seems that there is a straight line relationship between the variables Delays
and Rate.per.10K.Flights
.