Erika Duan 2021-08-06
- Introduction to expectation
- Introduction to variance
- Introduction to covariance
- Introduction to the Pearson correlation coefficient
- Other resources
# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
tidyverse,
patchwork)
The expectation of a random variable X, or E(X), is its long term average.
When we have a list of numbers, for example a sample of plant heights, it is easy to calculate the expectation or the average using the following equation:
However, in statistics, we are interested in modelling the probability distribution of a function (which is usually not a simple list of numbers).
The expectation of a discrete probability distribution is described by the following equation:
Imagine if a person in the community has a probability of testing positive for COVID-19 of 0.15.
- What is the probability that 0 people test positive for COVID-19 out of a random sample of 10 people?
- What is the probability that 1 person tests positive for COVID-19 out of a random sample of 10 people?
- What is the probability at least 2 people will test positive for COVID-19 out of a random sample of 10 people?
# Calculate P(X = 0) when n = 10 -----------------------------------------------
dbinom(x = 0, size = 10, prob = 0.15)
#> [1] 0.1968744
# Calculate P(X = 1) when n = 10 -----------------------------------------------
dbinom(x = 1, size = 10, prob = 0.15)
#> [1] 0.3474254
# Calculate P(X <= 2) when n = 10 ----------------------------------------------
dbinom(x = 0:2, size = 10, prob = 0.15) %>%
reduce(sum)
#> [1] 0.8201965
pbinom(q = 2, size = 10, prob = 0.15, lower.tail = T)
#> [1] 0.8201965
- What is the probability distribution for the number of people testing positive for COVID-19 out of a random sample of 10 people?
- What is the average number of people that will test positive for COVID-19 out of a random sample of 10 people?
# Calculate binomial distribution ----------------------------------------------
x1 <- c(0:10)
p_x1 <- dbinom(x = 0:10, size = 10, prob = 0.15)
# Calculate expectation of binomial distribution -------------------------------
expectation_b1 <- sum(x1 * p_x1)
expectation_b1
#> [1] 1.5
# Plot probability mass function for binomial distribution ---------------------
binom_dist <- tibble(x = c(0:10),
prob = dbinom(x = 0:10, size = 10, prob = 0.1))
binom_dist %>%
ggplot(aes(x = x, y = prob)) +
geom_segment(aes(x = x, xend = x, y = 0, yend = prob)) +
geom_point(size = 3, shape = 21, fill = "linen") +
geom_vline(xintercept = expectation_b1, colour = "firebrick", linetype = "dotted") +
scale_x_continuous(breaks = seq(0, 10, 1)) +
labs(x = "X",
y = "Probability",
title = "Probability mass distribution for a binomial distribution") +
theme_minimal() +
theme(panel.grid = element_blank(),
panel.border = element_rect(fill = NA, colour = "black"),
plot.title = element_text(hjust = 0.5)) +
annotate("text", x = 1.85, y = 0.6, label = "E(X)", colour = "firebrick")
The expectation of a continuous probability distribution is described by the following equation:
The function
represents
the height of the curve at point
and
represents the probability of
falling within a value
range.
The expectation of
, which is
a probability density function, is therefore the area under the curve of
or
.
Imagine that the height of tomato plants in a field is normally distributed. It is known that the average tomato plant height is 140 cm, with a standard deviation of 16 cm.
- What is the probability distribution for tomato plant height if the height of 100 plants was randomly measured?
- What is the mean tomato plant height if the height of 100 plants was randomly measured? How close is this to the given average of 140 cm?
# Sample values from a normal distribution -------------------------------------
set.seed(111)
norm_dist <- tibble(x = seq(92, 188, length = 100),
prob_density = dnorm(x, mean = 140, sd = 16))
# Calculate expectation of normal distribution ---------------------------------
# Store x * f(x) where f(x) is a normal distribution with mean 140 and sd 16
funs_x_fx <- function(x) x * (1/ (sqrt(2 * pi * 16 ^ 2)) * exp(-((x - 140) ^ 2) / (2 * 16 ^ 2)))
expectation_n1 <- integrate(funs_x_fx, lower = 92, upper = 188)
expectation_n1
#> 139.622 with absolute error < 0.00013
# Plot probability density function for normal distribution --------------------
norm_dist %>%
ggplot(aes(x = x, y = prob_density)) +
geom_line() +
geom_vline(xintercept = expectation_n1[[1]], colour = "firebrick", linetype = "dotted") +
scale_x_continuous(breaks = seq(90, 190, 10)) +
labs(x = "X",
y = "Probability density",
title = "Probability density function for a normal distribution") +
theme_minimal() +
theme(panel.grid = element_blank(),
panel.border = element_rect(fill = NA, colour = "black"),
plot.title = element_text(hjust = 0.5)) +
annotate("text", x = 145, y = 0.026, label = "E(X)", colour = "firebrick")
The variance of random variable X, or Var(X), is a description of how far away individual values of x are from the mean and is defined by the following equation:
If the expectation of a discrete probability distribution, or
, is:
Returning to the scenario where a person in the community has a probability of testing positive for COVID-19 of 0.15.
- What is the variance of the number of people who test positive for COVID-19 out of a random sample of 10 people?
# Calculate variance of binomial distribution ----------------------------------
x1 <- c(0:10)
p_x1 <- dbinom(x = 0:10, size = 10, prob = 0.15)
variance_b1 <- sum((x1 ^ 2) * p_x1) - sum(x1 * p_x1) ^ 2
variance_b1
#> [1] 1.275
- How does the variance of the number of people who test positive for COVID-19 out of a random sample of 10 people change when the probability of testing positive changes?
# Calculate binomial distribution ----------------------------------------------
p_x_all <- c(0.1, 0.4) %>%
map(~dbinom(x = 0:10, size = 10, prob = .x)) %>%
unlist()
binom_dist <- tibble(x = rep(c(0:10), 2),
binom_prob = c(paste0("p = ", rep(0.15, 11)),
paste0("p = ", rep(0.4, 11))),
prob = p_x_all)
# Calculate expectation and variance when p = 0.4 ------------------------------
expectation_b2 <- sum(x1 * p_x_all[12:22])
expectation_b2
#> [1] 4
variance_b2 <- sum((x1 ^ 2) * p_x_all[12:22]) - sum(x1 * p_x_all[12:22]) ^ 2
variance_b2
#> [1] 2.4
Returning to the scenario where the height of tomato plants in a field is normally distributed and it is known that the average tomato plant height is 140 cm, with a standard deviation of 16 cm.
- What is the variance of tomato plant height if the height of 100 plants was randomly measured? How close is this to the given standard deviation of 16 cm?
# Calculate variance of normal distribution ------------------------------------
# Store x * f(x) where f(x) is a normal distribution with mean 140 and sd 16
funs_x_fx <- function(x) x * (1/ (sqrt(2 * pi * 16 ^ 2)) * exp(-((x - 140) ^ 2) / (2 * 16 ^ 2)))
funs_x2_fx <- function(x) (x ^ 2) * (1/ (sqrt(2 * pi * 16 ^ 2)) * exp(-((x - 140) ^ 2) / (2 * 16 ^ 2)))
expectation_n1 <- integrate(funs_x_fx, lower = 92, upper = 188)
expectation_n1[[1]] ^ 2
#> [1] 19494.31
variance_n1 <- integrate(funs_x2_fx, lower = 92, upper = 188)[[1]] - (expectation_n1[[1]] ^ 2)
sqrt(variance_n1)
#> [[1] 17.35727
The covariance is a method to calculate whether variables X and Y have a positive or negative linear relationship or no relationship with each other. It is often used as an intermediate step in other algorithms, i.e. PCA.
# Covariance depends on variable scale -----------------------------------------
# Assume perfect linear relationship between X and Y
set.seed(111)
X1 <- runif(20, 0, 10) %>%
sort()
Y1 <- X1
# Calculate covariance ---------------------------------------------------------
cov(X1, Y1)
#> [1] 6.352332
cov(4 * X1, 4 * Y1)
#> [1] 101.6373
The following rule also holds:
Note: A covariance of 0 does not imply that X and Y are independent variables. For example, Y could have a perfect non-linear relationship with respect to X and have a covariance of 0.
# Simulate quadratic relationship between X and Y ------------------------------
quad <- tibble(X = seq(-6, 6, length = 100),
Y = -0.2 * (X ^ 2) + 6)
cov(quad$X, quad$Y)
#> [1] -9.611069e-16 (i.e. 0)
# Plot quadratic relationship and covariance ---------------------------------
quad %>%
ggplot(aes(x = X, y = Y)) +
geom_point(shape = 21, fill = "lavender") +
geom_vline(xintercept = mean(quad$X),
linetype = "dotted") +
labs(title = "Perfect relationship between X and Y") +
theme_minimal() +
theme(panel.grid = element_blank(),
panel.border = element_rect(fill = NA, colour = "black"),
plot.title = element_text(hjust = 0.5)) +
annotate("text", x = 3.5, y = 5.5, label = "Cov(X, Y) = 0", colour = "firebrick")
The Pearson correlation coefficient is a method to calculate the strength of a linear relationship between variables X and Y independent of variable scale.
# Calculate pearson correlation ------------------------------------------------
# Assume perfect linear relationship between X and Y
set.seed(111)
X1 <- runif(20, 0, 10) %>%
sort()
Y1 <- X1
# Calculate covariance ---------------------------------------------------------
cor(X1, Y1)
#> [1] 1
cor(4 * X1, 4 * Y1)
#> [1] 1
- Statistics textbook chapter on expectation and variance from the University of Auckland.
- StatQuest YouTube videos on how to find the expectation for a discrete versus continuous distribution.
- StatQuest YouTube videos on how to find the covariance versus Pearson’s correlation coefficient for two variables.
- A jbstatistics YouTube video on discrete probability distributions.
- A jbstatistics YouTube video on continuous probability distributions.