-
Notifications
You must be signed in to change notification settings - Fork 173
/
slr-inf.Rmd
938 lines (648 loc) · 36.4 KB
/
slr-inf.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
# Inference for Simple Linear Regression
```{r, include = FALSE}
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE, fig.align = "center")
```
> "There are three types of lies: lies, damn lies, and statistics."
>
> --- **Benjamin Disraeli**
After reading this chapter you will be able to:
- Understand the distributions of regression estimates.
- Create interval estimates for regression parameters, mean response, and predictions.
- Test for significance of regression.
Last chapter we defined the simple linear regression model,
\[
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
\]
where $\epsilon_i \sim N(0, \sigma^2)$. We then used observations $(x_i, y_i)$, for $i = 1, 2, \ldots n$, to find values of $\beta_0$ and $\beta_1$ which minimized
\[
f(\beta_0, \beta_1) = \sum_{i = 1}^{n}(y_i - (\beta_0 + \beta_1 x_i))^2.
\]
We called these values $\hat{\beta}_0$ and $\hat{\beta}_1$, which we found to be
\[
\begin{aligned}
\hat{\beta}_1 &= \frac{S_{xy}}{S_{xx}} = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^{n}(x_i - \bar{x})^2}\\
\hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x}.
\end{aligned}
\]
We also estimated $\sigma ^2$ using $s_e^2$. In other words, we found that $s_e$ is an estimate of $\sigma$, where;
\[
s_e = \text{RSE} = \sqrt{\frac{1}{n - 2}\sum_{i = 1}^n e_i^2}
\]
which we also called $\text{RSE}$, for "Residual Standard Error."
When applied to the `cars` data, we obtained the following results:
```{r}
stop_dist_model = lm(dist ~ speed, data = cars)
summary(stop_dist_model)
```
Last chapter, we only discussed the `Estimate`, `Residual standard error`, and `Multiple R-squared` values. In this chapter, we will discuss all of the information under `Coefficients` as well as `F-statistic`.
```{r}
plot(dist ~ speed, data = cars,
xlab = "Speed (in Miles Per Hour)",
ylab = "Stopping Distance (in Feet)",
main = "Stopping Distance vs Speed",
pch = 20,
cex = 2,
col = "grey")
abline(stop_dist_model, lwd = 5, col = "darkorange")
```
To get started, we'll note that there is another equivalent expression for $S_{xy}$ which we did not see last chapter,
\[
S_{xy}= \sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y}) = \sum_{i = 1}^{n}(x_i - \bar{x}) y_i.
\]
This may be a surprising equivalence. (Maybe try to prove it.) However, it will be useful for illustrating concepts in this chapter.
Note that $\hat{\beta}_1$ is a **sample statistic** when calculated with observed data as written above, as is $\hat{\beta}_0$.
However, in this chapter it will often be convenient to use both $\hat{\beta}_1$ and $\hat{\beta}_0$ as **random variables**, that is, we have not yet observed the values for each $Y_i$. When this is the case, we will use a slightly different notation, substituting in capital $Y_i$ for lower case $y_i$.
\[
\begin{aligned}
\hat{\beta}_1 &= \frac{\sum_{i = 1}^{n}(x_i - \bar{x}) Y_i}{\sum_{i = 1}^{n}(x_i - \bar{x})^2} \\
\hat{\beta}_0 &= \bar{Y} - \hat{\beta}_1 \bar{x}
\end{aligned}
\]
Last chapter we argued that these estimates of unknown model parameters $\beta_0$ and $\beta_1$ were good because we obtained them by minimizing errors. We will now discuss the Gauss–Markov theorem, which takes this idea further, showing that these estimates are actually the "best" estimates, from a certain point of view.
## Gauss–Markov Theorem
The **Gauss–Markov theorem** tells us that when estimating the parameters of the simple linear regression model $\beta_0$ and $\beta_1$, the $\hat{\beta}_0$ and $\hat{\beta}_1$ which we derived are the **best linear unbiased estimates**, or *BLUE* for short. (The actual conditions for the Gauss–Markov theorem are more relaxed than the SLR model.)
We will now discuss *linear*, *unbiased*, and *best* as they relate to these estimates.
#### Linear {-}
Recall, in the SLR setup that the $x_i$ values are considered fixed and known quantities. Then a **linear** estimate is one which can be written as a linear combination of the $Y_i$. In the case of $\hat{\beta}_1$ we see
\[
\hat{\beta}_1 = \frac{\sum_{i = 1}^{n}(x_i - \bar{x}) Y_i}{\sum_{i = 1}^{n}(x_i - \bar{x})^2} = \sum_{i = 1}^n k_i Y_i = k_1 Y_1 + k_2 Y_2 + \cdots k_n Y_n
\]
where $k_i = \displaystyle\frac{(x_i - \bar{x})}{\sum_{i = 1}^{n}(x_i - \bar{x})^2}$.
In a similar fashion, we could show that $\hat{\beta}_0$ can be written as a linear combination of the $Y_i$. Thus both $\hat{\beta}_0$ and $\hat{\beta}_1$ are linear estimators.
#### Unbiased {-}
Now that we know our estimates are *linear*, how good are these estimates? One measure of the "goodness" of an estimate is its **bias**. Specifically, we prefer estimates that are **unbiased**, meaning their expected value is the parameter being estimated.
In the case of the regression estimates, we have,
\[
\begin{aligned}
\text{E}[\hat{\beta}_0] &= \beta_0 \\
\text{E}[\hat{\beta}_1] &= \beta_1.
\end{aligned}
\]
This tells us that, when the conditions of the SLR model are met, on average our estimates will be correct. However, as we saw last chapter when simulating from the SLR model, that does not mean that each individual estimate will be correct. Only that, if we repeated the process an infinite number of times, on average the estimate would be correct.
#### Best {-}
Now, if we restrict ourselves to both *linear* and *unbiased* estimates, how do we define the *best* estimate? The estimate with the **minimum variance**.
First note that it is very easy to create an estimate for $\beta_1$ that has very low variance, but is not unbiased. For example, define:
\[
\hat{\theta}_{BAD} = 5.
\]
Then, since $\hat{\theta}_{BAD}$ is a constant value,
\[
\text{Var}[\hat{\theta}_{BAD}] = 0.
\]
However since,
\[
\text{E}[\hat{\theta}_{BAD}] = 5
\]
we say that $\hat{\theta}_{BAD}$ is a biased estimator unless $\beta_1 = 5$, which we would not know ahead of time. For this reason, it is a terrible estimate (unless by chance $\beta_1 = 5$) even though it has the smallest possible variance. This is part of the reason we restrict ourselves to *unbiased* estimates. What good is an estimate, if it estimates the wrong quantity?
So now, the natural question is, what are the variances of $\hat{\beta}_0$ and $\hat{\beta}_1$? They are,
\[
\begin{aligned}
\text{Var}[\hat{\beta}_0] &= \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right) \\
\text{Var}[\hat{\beta}_1] &= \frac{\sigma^2}{S_{xx}}.
\end{aligned}
\]
These quantify the variability of the estimates due to random chance during sampling. Are these "the best"? Are these variances as small as we can possibly get? You'll just have to take our word for it that they are because showing that this is true is beyond the scope of this course.
## Sampling Distributions
Now that we have "redefined" the estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$ as random variables, we can discuss their **sampling distribution**, which is the distribution when a statistic is considered a random variable.
Since both $\hat{\beta}_0$ and $\hat{\beta}_1$ are a linear combination of the $Y_i$ and each $Y_i$ is normally distributed, then both $\hat{\beta}_0$ and $\hat{\beta}_1$ also follow a normal distribution.
Then, putting all of the above together, we arrive at the distributions of $\hat{\beta}_0$ and $\hat{\beta}_1$.
For $\hat{\beta}_1$ we say,
\[
\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}
= \frac{\sum_{i = 1}^{n}(x_i - \bar{x}) Y_i}{\sum_{i = 1}^{n}(x_i - \bar{x})^2}
\sim N\left( \beta_1, \ \frac{\sigma^2}{\sum_{i = 1}^{n}(x_i - \bar{x})^2} \right).
\]
Or more succinctly,
\[
\hat{\beta}_1 \sim N\left( \beta_1, \frac{\sigma^2}{S_{xx}} \right).
\]
And for $\hat{\beta}_0$,
\[
\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{x}
\sim N\left( \beta_0, \ \frac{\sigma^2 \sum_{i = 1}^{n}x_i^2}{n \sum_{i = 1}^{n}(x_i - \bar{x})^2} \right).
\]
Or more succinctly,
\[
\hat{\beta}_0 \sim N\left( \beta_0, \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right) \right)
\]
At this point we have neglected to prove a number of these results. Instead of working through the tedious derivations of these sampling distributions, we will instead justify these results to ourselves using simulation.
A note to current readers: These derivations and proofs may be added to an appendix at a later time. You can also find these results in nearly any standard linear regression textbook. At UIUC, these results will likely be presented in both STAT 424 and STAT 425. However, since you will not be asked to perform derivations of this type in this course, they are for now omitted.
### Simulating Sampling Distributions
To verify the above results, we will simulate samples of size $n = 100$ from the model
\[
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
\]
where $\epsilon_i \sim N(0, \sigma^2).$ In this case, the parameters are known to be:
- $\beta_0 = 3$
- $\beta_1 = 6$
- $\sigma^2 = 4$
Then, based on the above, we should find that
\[
\hat{\beta}_1 \sim N\left( \beta_1, \frac{\sigma^2}{S_{xx}} \right)
\]
and
\[
\hat{\beta}_0 \sim N\left( \beta_0, \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right) \right).
\]
First we need to decide ahead of time what our $x$ values will be for this simulation, since the $x$ values in SLR are also considered known quantities. The choice of $x$ values is arbitrary. Here we also set a seed for randomization, and calculate $S_{xx}$ which we will need going forward.
```{r}
set.seed(42)
sample_size = 100 # this is n
x = seq(-1, 1, length = sample_size)
Sxx = sum((x - mean(x)) ^ 2)
```
We also fix our parameter values.
```{r}
beta_0 = 3
beta_1 = 6
sigma = 2
```
With this information, we know the sampling distributions should be:
```{r}
(var_beta_1_hat = sigma ^ 2 / Sxx)
(var_beta_0_hat = sigma ^ 2 * (1 / sample_size + mean(x) ^ 2 / Sxx))
```
\[
\hat{\beta}_1 \sim N( `r beta_1`, `r var_beta_1_hat`)
\]
and
\[
\hat{\beta}_0 \sim N( `r beta_0`, `r var_beta_0_hat`).
\]
That is,
\[
\begin{aligned}
\text{E}[\hat{\beta}_1] &= `r beta_1` \\
\text{Var}[\hat{\beta}_1] &= `r var_beta_1_hat`
\end{aligned}
\]
and
\[
\begin{aligned}
\text{E}[\hat{\beta}_0] &= `r beta_0` \\
\text{Var}[\hat{\beta}_0] &= `r var_beta_0_hat`.
\end{aligned}
\]
We now simulate data from this model 10,000 times. Note this may not be the most `R` way of doing the simulation. We perform the simulation in this manner in an attempt at clarity. For example, we could have used the `sim_slr()` function from the previous chapter. We also simply store variables in the global environment instead of creating a data frame for each new simulated dataset.
```{r}
num_samples = 10000
beta_0_hats = rep(0, num_samples)
beta_1_hats = rep(0, num_samples)
for (i in 1:num_samples) {
eps = rnorm(sample_size, mean = 0, sd = sigma)
y = beta_0 + beta_1 * x + eps
sim_model = lm(y ~ x)
beta_0_hats[i] = coef(sim_model)[1]
beta_1_hats[i] = coef(sim_model)[2]
}
```
Each time we simulated the data, we obtained values of the estimated coefficients. The variables `beta_0_hats` and `beta_1_hats` now store 10,000 simulated values of $\hat{\beta}_0$ and $\hat{\beta}_1$ respectively.
We first verify the distribution of $\hat{\beta}_1$.
```{r}
mean(beta_1_hats) # empirical mean
beta_1 # true mean
var(beta_1_hats) # empirical variance
var_beta_1_hat # true variance
```
We see that the empirical and true means and variances are *very* similar. We also verify that the empirical distribution is normal. To do so, we plot a histogram of the `beta_1_hats`, and add the curve for the true distribution of $\hat{\beta}_1$. We use `prob = TRUE` to put the histogram on the same scale as the normal curve.
```{r}
# note need to use prob = TRUE
hist(beta_1_hats, prob = TRUE, breaks = 20,
xlab = expression(hat(beta)[1]), main = "", border = "dodgerblue")
curve(dnorm(x, mean = beta_1, sd = sqrt(var_beta_1_hat)),
col = "darkorange", add = TRUE, lwd = 3)
```
We then repeat the process for $\hat{\beta}_0$.
```{r}
mean(beta_0_hats) # empirical mean
beta_0 # true mean
var(beta_0_hats) # empirical variance
var_beta_0_hat # true variance
```
```{r}
hist(beta_0_hats, prob = TRUE, breaks = 25,
xlab = expression(hat(beta)[0]), main = "", border = "dodgerblue")
curve(dnorm(x, mean = beta_0, sd = sqrt(var_beta_0_hat)),
col = "darkorange", add = TRUE, lwd = 3)
```
In this simulation study, we have only simulated a finite number of samples. To truly verify the distributional results, we would need to observe an infinite number of samples. However, the following plot should make it clear that if we continued simulating, the empirical results would get closer and closer to what we should expect.
```{r}
par(mar = c(5, 5, 1, 1)) # adjusted plot margins, otherwise the "hat" does not display
plot(cumsum(beta_1_hats) / (1:length(beta_1_hats)), type = "l", ylim = c(5.95, 6.05),
xlab = "Number of Simulations",
ylab = expression("Empirical Mean of " ~ hat(beta)[1]),
col = "dodgerblue")
abline(h = 6, col = "darkorange", lwd = 2)
par(mar = c(5, 5, 1, 1)) # adjusted plot margins, otherwise the "hat" does not display
plot(cumsum(beta_0_hats) / (1:length(beta_0_hats)), type = "l", ylim = c(2.95, 3.05),
xlab = "Number of Simulations",
ylab = expression("Empirical Mean of " ~ hat(beta)[0]),
col = "dodgerblue")
abline(h = 3, col = "darkorange", lwd = 2)
```
## Standard Errors
So now we believe the two distributional results,
\[
\begin{aligned}
\hat{\beta}_0 &\sim N\left( \beta_0, \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right) \right) \\
\hat{\beta}_1 &\sim N\left( \beta_1, \frac{\sigma^2}{S_{xx}} \right).
\end{aligned}
\]
Then by standardizing these results we find that
\[
\frac{\hat{\beta}_0 - \beta_0}{\text{SD}[\hat{\beta}_0]} \sim N(0, 1)
\]
and
\[
\frac{\hat{\beta}_1 - \beta_1}{\text{SD}[\hat{\beta}_1]} \sim N(0, 1)
\]
where
\[
\text{SD}[\hat{\beta}_0] = \sigma\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}}
\]
and
\[
\text{SD}[\hat{\beta}_1] = \frac{\sigma}{\sqrt{S_{xx}}}.
\]
Since we don't know $\sigma$ in practice, we will have to estimate it using $s_e$, which we plug into our existing expression for the standard deviations of our estimates.
These two new expressions are called **standard errors** which are the *estimated* standard deviations of the sampling distributions.
\[
\text{SE}[\hat{\beta}_0] = s_e\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}}
\]
\[
\text{SE}[\hat{\beta}_1] = \frac{s_e}{\sqrt{S_{xx}}}
\]
Now if we divide by the standard error, instead of the standard deviation, we obtain the following results which will allow us to make confidence intervals and perform hypothesis testing.
\[
\frac{\hat{\beta}_0 - \beta_0}{\text{SE}[\hat{\beta}_0]} \sim t_{n-2}
\]
\[
\frac{\hat{\beta}_1 - \beta_1}{\text{SE}[\hat{\beta}_1]} \sim t_{n-2}
\]
To see this, first note that,
\[
\frac{\text{RSS}}{\sigma^2} = \frac{(n-2)s_e^2}{\sigma^2} \sim \chi_{n-2}^2.
\]
Also recall that a random variable $T$ defined as,
\[
T = \frac{Z}{\sqrt{\frac{\chi_{d}^2}{d}}}
\]
follows a $t$ distribution with $d$ degrees of freedom, where $\chi_{d}^2$ is a $\chi^2$ random variable with $d$ degrees of freedom.
We write,
\[
T \sim t_d
\]
to say that the random variable $T$ follows a $t$ distribution with $d$ degrees of freedom.
Then we use the classic trick of "multiply by 1" and some rearranging to arrive at
\[
\begin{aligned}
\frac{\hat{\beta}_1 - \beta_1}{\text{SE}[\hat{\beta}_1]}
&= \frac{\hat{\beta}_1 - \beta_1}{s_e / \sqrt{S_{xx}}} \\
&= \frac{\hat{\beta}_1 - \beta_1}{s_e / \sqrt{S_{xx}}} \cdot \frac{\sigma / \sqrt{S_{xx}}}{\sigma / \sqrt{S_{xx}}} \\
&= \frac{\hat{\beta}_1 - \beta_1}{\sigma / \sqrt{S_{xx}}} \cdot \frac{\sigma / \sqrt{S_{xx}}}{s_e / \sqrt{S_{xx}}} \\
&= \frac{\hat{\beta}_1 - \beta_1}{\sigma / \sqrt{S_{xx}}} \bigg/ \sqrt{\frac{s_e^2}{\sigma^2}} \\
&= \frac{\hat{\beta}_1 - \beta_1}{\text{SD}[\hat{\beta}_1]} \bigg/ \sqrt{\frac{\frac{(n - 2)s_e^2}{\sigma^2}}{n - 2}}
\sim \frac{Z}{\sqrt{\frac{\chi_{n-2}^2}{n-2}}}
\sim t_{n-2}
\end{aligned}
\]
where $Z \sim N(0,1)$.
Recall that a $t$ distribution is similar to a standard normal, but with heavier tails. As the degrees of freedom increases, the $t$ distribution becomes more and more like a standard normal. Below we plot a standard normal distribution as well as two examples of a $t$ distribution with different degrees of freedom. Notice how the $t$ distribution with the larger degrees of freedom is more similar to the standard normal curve.
```{r}
# define grid of x values
x = seq(-4, 4, length = 100)
# plot curve for standard normal
plot(x, dnorm(x), type = "l", lty = 1, lwd = 2,
xlab = "x", ylab = "Density", main = "Normal vs t Distributions")
# add curves for t distributions
lines(x, dt(x, df = 1), lty = 3, lwd = 2, col = "darkorange")
lines(x, dt(x, df = 10), lty = 2, lwd = 2, col = "dodgerblue")
# add legend
legend("topright", title = "Distributions",
legend = c("t, df = 1", "t, df = 10", "Standard Normal"),
lwd = 2, lty = c(3, 2, 1), col = c("darkorange", "dodgerblue", "black"))
```
## Confidence Intervals for Slope and Intercept
Recall that confidence intervals for means often take the form:
\[
\text{EST} \pm \text{CRIT} \cdot \text{SE}
\]
or
\[
\text{EST} \pm \text{MARGIN}
\]
where $\text{EST}$ is an estimate for the parameter of interest, $\text{SE}$ is the standard error of the estimate, and $\text{MARGIN} = \text{CRIT} \cdot \text{SE}$.
Then, for $\beta_0$ and $\beta_1$ we can create confidence intervals using
\[
\hat{\beta}_0 \pm t_{\alpha/2, n - 2} \cdot \text{SE}[\hat{\beta}_0] \quad \quad \quad \hat{\beta}_0 \pm t_{\alpha/2, n - 2} \cdot s_e\sqrt{\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}}}
\]
and
\[
\hat{\beta}_1 \pm t_{\alpha/2, n - 2} \cdot \text{SE}[\hat{\beta}_1] \quad \quad \quad \hat{\beta}_1 \pm t_{\alpha/2, n - 2} \cdot \frac{s_e}{\sqrt{S_{xx}}}
\]
where $t_{\alpha/2, n - 2}$ is the critical value such that $P(t_{n-2} > t_{\alpha/2, n - 2}) = \alpha/2$.
## Hypothesis Tests
> "We may speak of this hypothesis as the '[null hypothesis](https://xkcd.com/892/){target="_blank"},' and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation."
>
> --- **Ronald Aylmer Fisher**
Recall that a test statistic ($\text{TS}$) for testing means often takes the form:
\[
\text{TS} = \frac{\text{EST} - \text{HYP}}{\text{SE}}
\]
where $\text{EST}$ is an estimate for the parameter of interest, $\text{HYP}$ is a hypothesized value of the parameter, and $\text{SE}$ is the standard error of the estimate.
So, to test
\[
H_0: \beta_0 = \beta_{00} \quad \text{vs} \quad H_1: \beta_0 \neq \beta_{00}
\]
we use the test statistic
\[
t = \frac{\hat{\beta}_0 - \beta_{00}}{\text{SE}[\hat{\beta}_0]} = \frac{\hat{\beta}_0-\beta_{00}}{s_e\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}}}
\]
which, under the null hypothesis, follows a $t$ distribution with $n - 2$ degrees of freedom. We use $\beta_{00}$ to denote the hypothesized value of $\beta_0$.
Similarly, to test
\[
H_0: \beta_1 = \beta_{10} \quad \text{vs} \quad H_1: \beta_1 \neq \beta_{10}
\]
we use the test statistic
\[
t = \frac{\hat{\beta}_1-\beta_{10}}{\text{SE}[\hat{\beta}_1]} = \frac{\hat{\beta}_1-\beta_{10}}{s_e / \sqrt{S_{xx}}}
\]
which again, under the null hypothesis, follows a $t$ distribution with $n - 2$ degrees of freedom. We now use $\beta_{10}$ to denote the hypothesized value of $\beta_1$.
## `cars` Example
We now return to the `cars` example from last chapter to illustrate these concepts. We first fit the model using `lm()` then use `summary()` to view the results in greater detail.
```{r}
stop_dist_model = lm(dist ~ speed, data = cars)
summary(stop_dist_model)
```
### Tests in `R`
We will now discuss the results displayed called `Coefficients`. First recall that we can extract this information directly.
```{r}
names(summary(stop_dist_model))
summary(stop_dist_model)$coefficients
```
The `names()` function tells us what information is available, and then we use the `$` operator and `coefficients` to extract the information we are interested in. Two values here should be immediately familiar.
\[
\hat{\beta}_0 = `r summary(stop_dist_model)$coefficients[1,1]`
\]
and
\[
\hat{\beta}_1 = `r summary(stop_dist_model)$coefficients[2,1]`
\]
which are our estimates for the model parameters $\beta_0$ and $\beta_1$.
Let's now focus on the second row of output, which is relevant to $\beta_1$.
```{r}
summary(stop_dist_model)$coefficients[2,]
```
Again, the first value, `Estimate` is
\[
\hat{\beta}_1 = `r summary(stop_dist_model)$coefficients[2,1]`.
\]
The second value, `Std. Error`, is the standard error of $\hat{\beta}_1$,
\[
\text{SE}[\hat{\beta}_1] = \frac{s_e}{\sqrt{S_{xx}}} = `r summary(stop_dist_model)$coefficients[2,2]`.
\]
The third value, `t value`, is the value of the test statistic for testing $H_0: \beta_1 = 0$ vs $H_1: \beta_1 \neq 0$,
\[
t = \frac{\hat{\beta}_1-0}{\text{SE}[\hat{\beta}_1]} = \frac{\hat{\beta}_1-0}{s_e / \sqrt{S_{xx}}} = `r summary(stop_dist_model)$coefficients[2,3]`.
\]
Lastly, `Pr(>|t|)`, gives us the p-value of that test.
\[
\text{p-value} = `r summary(stop_dist_model)$coefficients[2,4]`
\]
Note here, we are specifically testing whether or not $\beta_1 = 0$.
The first row of output reports the same values, but for $\beta_0$.
```{r}
summary(stop_dist_model)$coefficients[1,]
```
In summary, the following code stores the information of `summary(stop_dist_model)$coefficients` in a new variable `stop_dist_model_test_info`, then extracts each element into a new variable which describes the information it contains.
```{r}
stop_dist_model_test_info = summary(stop_dist_model)$coefficients
beta_0_hat = stop_dist_model_test_info[1, 1] # Estimate
beta_0_hat_se = stop_dist_model_test_info[1, 2] # Std. Error
beta_0_hat_t = stop_dist_model_test_info[1, 3] # t value
beta_0_hat_pval = stop_dist_model_test_info[1, 4] # Pr(>|t|)
beta_1_hat = stop_dist_model_test_info[2, 1] # Estimate
beta_1_hat_se = stop_dist_model_test_info[2, 2] # Std. Error
beta_1_hat_t = stop_dist_model_test_info[2, 3] # t value
beta_1_hat_pval = stop_dist_model_test_info[2, 4] # Pr(>|t|)
```
We can then verify some equivalent expressions: the $t$ test statistic for $\hat{\beta}_1$ and the two-sided p-value associated with that test statistic.
```{r}
(beta_1_hat - 0) / beta_1_hat_se
beta_1_hat_t
```
```{r}
2 * pt(abs(beta_1_hat_t), df = length(resid(stop_dist_model)) - 2, lower.tail = FALSE)
beta_1_hat_pval
```
### Significance of Regression, t-Test
We pause to discuss the **significance of regression** test. First, note that based on the above distributional results, we could test $\beta_0$ and $\beta_1$ against any particular value, and perform both one and two-sided tests.
However, one very specific test,
\[
H_0: \beta_1 = 0 \quad \text{vs} \quad H_1: \beta_1 \neq 0
\]
is used most often. Let's think about this test in terms of the simple linear regression model,
\[
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i.
\]
If we assume the null hypothesis is true, then $\beta_1 = 0$ and we have the model,
\[
Y_i = \beta_0 + \epsilon_i.
\]
In this model, the response does **not** depend on the predictor. So then we could think of this test in the following way,
- Under $H_0$ there is not a significant linear relationship between $x$ and $y$.
- Under $H_1$ there is a significant **linear** relationship between $x$ and $y$.
For the `cars` example,
- Under $H_0$ there is not a significant linear relationship between speed and stopping distance.
- Under $H_1$ there is a significant **linear** relationship between speed and stopping distance.
Again, that test is seen in the output from `summary()`,
\[
\text{p-value} = `r summary(stop_dist_model)$coefficients[2,4]`.
\]
With this extremely low p-value, we would reject the null hypothesis at any reasonable $\alpha$ level, say for example $\alpha = 0.01$. So we say there is a significant **linear** relationship between speed and stopping distance. Notice that we emphasize **linear**.
```{r, echo = FALSE}
set.seed(42)
x = seq(-1, 1, 0.01)
y = 5 + 4 * x ^ 2 + rnorm(length(x), 0, 0.5)
plot(x, y, pch = 20, cex = 2, col = "grey")
abline(lm(y ~ x), lwd = 3, col = "darkorange")
```
In this plot of simulated data, we see a clear relationship between $x$ and $y$, however it is not a linear relationship. If we fit a line to this data, it is very flat. The resulting test for $H_0: \beta_1 = 0$ vs $H_1: \beta_1 \neq 0$ gives a large p-value, in this case $`r summary(lm(y ~ x))$coef[2,4]`$, so we would fail to reject and say that there is no significant linear relationship between $x$ and $y$. We will see later how to fit a curve to this data using a "linear" model, but for now, realize that testing $H_0: \beta_1 = 0$ vs $H_1: \beta_1 \neq 0$ can only detect straight line relationships.
### Confidence Intervals in `R`
Using `R` we can very easily obtain the confidence intervals for $\beta_0$ and $\beta_1$.
```{r}
confint(stop_dist_model, level = 0.99)
```
This automatically calculates 99% confidence intervals for both $\beta_0$ and $\beta_1$, the first row for $\beta_0$, the second row for $\beta_1$.
For the `cars` example when interpreting these intervals, we say, we are 99% confident that for an increase in speed of 1 mile per hour, the average increase in stopping distance is between `r confint(stop_dist_model, level = 0.99)[2,1]` and `r confint(stop_dist_model, level = 0.99)[2,2]` feet, which is the interval for $\beta_1$.
Note that this 99% confidence interval does **not** contain the hypothesized value of 0. Since it does not contain 0, it is equivalent to rejecting the test of $H_0: \beta_1 = 0$ vs $H_1: \beta_1 \neq 0$ at $\alpha = 0.01$, which we had seen previously.
You should be somewhat suspicious of the confidence interval for $\beta_0$, as it covers negative values, which correspond to negative stopping distances. Technically the interpretation would be that we are 99% confident that the average stopping distance of a car traveling 0 miles per hour is between `r confint(stop_dist_model, level = 0.99)[1,1]` and `r confint(stop_dist_model, level = 0.99)[1,2]` feet, but we don't really believe that, since we are actually certain that it would be non-negative.
Note, we can extract specific values from this output a number of ways. This code is not run, and instead, you should check how it relates to the output of the code above.
```{r, eval = FALSE}
confint(stop_dist_model, level = 0.99)[1,]
confint(stop_dist_model, level = 0.99)[1, 1]
confint(stop_dist_model, level = 0.99)[1, 2]
confint(stop_dist_model, parm = "(Intercept)", level = 0.99)
confint(stop_dist_model, level = 0.99)[2,]
confint(stop_dist_model, level = 0.99)[2, 1]
confint(stop_dist_model, level = 0.99)[2, 2]
confint(stop_dist_model, parm = "speed", level = 0.99)
```
We can also verify that calculations that `R` is performing for the $\beta_1$ interval.
```{r}
# store estimate
beta_1_hat = coef(stop_dist_model)[2]
# store standard error
beta_1_hat_se = summary(stop_dist_model)$coefficients[2, 2]
# calculate critical value for two-sided 99% CI
crit = qt(0.995, df = length(resid(stop_dist_model)) - 2)
# est - margin, est + margin
c(beta_1_hat - crit * beta_1_hat_se, beta_1_hat + crit * beta_1_hat_se)
```
## Confidence Interval for Mean Response
In addition to confidence intervals for $\beta_0$ and $\beta_1$, there are two other common interval estimates used with regression. The first is called a **confidence interval for the mean response**. Often, we would like an interval estimate for the mean, $E[Y \mid X = x]$ for a particular value of $x$.
In this situation we use $\hat{y}(x)$ as our estimate of $E[Y \mid X = x]$. We modify our notation slightly to make it clear that the predicted value is a function of the $x$ value.
\[
\hat{y}(x) = \hat{\beta}_0 + \hat{\beta}_1 x
\]
Recall that,
\[
\text{E}[Y \mid X = x] = \beta_0 + \beta_1 x.
\]
Thus, $\hat{y}(x)$ is a good estimate since it is unbiased:
\[
\text{E}[\hat{y}(x)] = \beta_0 + \beta_1 x.
\]
We could then derive,
\[
\text{Var}[\hat{y}(x)] = \sigma^2 \left(\frac{1}{n}+\frac{(x-\bar{x})^2}{S_{xx}}\right).
\]
Like the other estimates we have seen, $\hat{y}(x)$ also follows a normal distribution. Since $\hat{\beta}_0$ and $\hat{\beta}_1$ are linear combinations of normal random variables, $\hat{y}(x)$ is as well.
\[
\hat{y}(x) \sim N \left(\beta_0 + \beta_1 x, \sigma^2 \left(\frac{1}{n}+\frac{(x-\bar{x})^2}{S_{xx}}\right) \right)
\]
And lastly, since we need to estimate this variance, we arrive at the standard error of our estimate,
\[
\text{SE}[\hat{y}(x)] = s_e \sqrt{\frac{1}{n}+\frac{(x-\bar{x})^2}{S_{xx}}}.
\]
We can then use this to find the confidence interval for the mean response,
\[
\hat{y}(x) \pm t_{\alpha/2, n - 2} \cdot s_e\sqrt{\frac{1}{n}+\frac{(x-\bar{x})^2}{S_{xx}}}
\]
To find confidence intervals for the mean response using `R`, we use the `predict()` function. We give the function our fitted model as well as new data, stored as a data frame. (This is important, so that `R` knows the name of the predictor variable.) Here, we are finding the confidence interval for the mean stopping distance when a car is travelling 5 miles per hour and when a car is travelling 21 miles per hour.
```{r}
new_speeds = data.frame(speed = c(5, 21))
predict(stop_dist_model, newdata = new_speeds,
interval = c("confidence"), level = 0.99)
```
## Prediction Interval for New Observations
Sometimes we would like an interval estimate for a new observation, $Y$, for a particular value of $x$. This is very similar to an interval for the mean response, $\text{E}[Y \mid X = x]$, but different in one very important way.
Our best guess for a new observation is still $\hat{y}(x)$. The estimated mean is still the best prediction we can make. The difference is in the amount of variability. We know that observations will vary about the true regression line according to a $N(0, \sigma^2)$ distribution. Because of this we add an extra factor of $\sigma^2$ to our estimate's variability in order to account for the variability of observations about the regression line.
\[
\begin{aligned}
\text{Var}[\hat{y}(x) + \epsilon] &= \text{Var}[\hat{y}(x)] + \text{Var}[\epsilon] \\[2ex]
&= \sigma^2 \left(\frac{1}{n}+\frac{(x-\bar{x})^2}{S_{xx}}\right) + \sigma^2 \\[2ex]
&= \sigma^2 \left(1 + \frac{1}{n}+\frac{(x-\bar{x})^2}{S_{xx}}\right)
\end{aligned}
\]
\[
\hat{y}(x) + \epsilon \sim N \left(\beta_0 + \beta_1 x, \ \sigma^2 \left(1 + \frac{1}{n}+\frac{(x-\bar{x})^2}{S_{xx}}\right) \right)
\]
\[
\text{SE}[\hat{y}(x) + \epsilon] = s_e \sqrt{1 + \frac{1}{n}+\frac{(x-\bar{x})^2}{S_{xx}}}
\]
We can then find a **prediction interval** using,
\[
\hat{y}(x) \pm t_{\alpha/2, n - 2} \cdot s_e\sqrt{1 + \frac{1}{n}+\frac{(x-\bar{x})^2}{S_{xx}}}.
\]
To calculate this for a set of points in `R` notice there is only a minor change in syntax from finding a confidence interval for the mean response.
```{r}
predict(stop_dist_model, newdata = new_speeds,
interval = c("prediction"), level = 0.99)
```
Also notice that these two intervals are wider than the corresponding confidence intervals for the mean response.
## Confidence and Prediction Bands
Often we will like to plot both confidence intervals for the mean response and prediction intervals for all possible values of $x$. We calls these confidence and prediction bands.
```{r, fig.height = 8, fig.width = 10, message = TRUE}
speed_grid = seq(min(cars$speed), max(cars$speed), by = 0.01)
dist_ci_band = predict(stop_dist_model,
newdata = data.frame(speed = speed_grid),
interval = "confidence", level = 0.99)
dist_pi_band = predict(stop_dist_model,
newdata = data.frame(speed = speed_grid),
interval = "prediction", level = 0.99)
plot(dist ~ speed, data = cars,
xlab = "Speed (in Miles Per Hour)",
ylab = "Stopping Distance (in Feet)",
main = "Stopping Distance vs Speed",
pch = 20,
cex = 2,
col = "grey",
ylim = c(min(dist_pi_band), max(dist_pi_band)))
abline(stop_dist_model, lwd = 5, col = "darkorange")
lines(speed_grid, dist_ci_band[,"lwr"], col = "dodgerblue", lwd = 3, lty = 2)
lines(speed_grid, dist_ci_band[,"upr"], col = "dodgerblue", lwd = 3, lty = 2)
lines(speed_grid, dist_pi_band[,"lwr"], col = "dodgerblue", lwd = 3, lty = 3)
lines(speed_grid, dist_pi_band[,"upr"], col = "dodgerblue", lwd = 3, lty = 3)
points(mean(cars$speed), mean(cars$dist), pch = "+", cex = 3)
```
Some things to notice:
- We use the `ylim` argument to stretch the $y$-axis of the plot, since the bands extend further than the points.
- We add a point at the point $(\bar{x}, \bar{y})$.
- This is a point that the regression line will **always** pass through. (Think about why.)
- This is the point where both the confidence and prediction bands are the narrowest. Look at the standard errors of both to understand why.
- The prediction bands (dotted blue) are less curved than the confidence bands (dashed blue). This is a result of the extra factor of $\sigma^2$ added to the variance at any value of $x$.
## Significance of Regression, F-Test
In the case of simple linear regression, the $t$ test for the significance of the regression is equivalent to another test, the $F$ test for the significance of the regression. This equivalence will only be true for simple linear regression, and in the next chapter we will only use the $F$ test for the significance of the regression.
Recall from last chapter the decomposition of variance we saw before calculating $R^2$,
\[
\sum_{i=1}^{n}(y_i - \bar{y})^2 = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2,
\]
or, in short,
\[
\text{SST} = \text{SSE} + \text{SSReg}.
\]
To develop the $F$ test, we will arrange this information in an **ANOVA** table,
| Source | Sum of Squares | Degrees of Freedom | Mean Square | $F$ |
|------------|-----------------|------------|-----------|-----------|
| Regression | $\sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2$ | $1$ | $\text{SSReg} / 1$ | $\text{MSReg} / \text{MSE}$ |
| Error | $\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ | $n - 2$ | $\text{SSE} / (n - 2)$ | |
| Total | $\sum_{i=1}^{n}(y_i - \bar{y})^2$ | $n - 1$ | | |
ANOVA, or Analysis of Variance will be a concept we return to often in this course. For now, we will focus on the results of the table, which is the $F$ statistic,
\[
F = \frac{\sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 / 1}{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 / (n - 2)} \sim F_{1, n - 2}
\]
which follows an $F$ distribution with degrees of freedom $1$ and $n - 2$ under the null hypothesis. An $F$ distribution is a continuous distribution which takes only positive values and has two parameters, which are the two degrees of freedom.
Recall, in the significance of the regression test, $Y$ does **not** depend on $x$ in the null hypothesis.
\[
H_0: \beta_1 = 0 \quad \quad Y_i = \beta_0 + \epsilon_i
\]
While in the alternative hypothesis $Y$ may depend on $x$.
\[
H_1: \beta_1 \neq 0 \quad \quad Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
\]
We can use the $F$ statistic to perform this test.
\[
F = \frac{\sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 / 1}{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 / (n - 2)}
\]
In particular, we will reject the null when the $F$ statistic is large, that is, when there is a low probability that the observations could have come from the null model by chance. We will let `R` calculate the p-value for us.
To perform the $F$ test in `R` you can look at the last row of the output from `summary()` called `F-statistic` which gives the value of the test statistic, the relevant degrees of freedom, as well as the p-value of the test.
```{r}
summary(stop_dist_model)
```
Additionally, you can use the `anova()` function to display the information in an ANOVA table.
```{r}
anova(stop_dist_model)
```
This also gives a p-value for the test. You should notice that the p-value from the $t$ test was the same. You might also notice that the value of the test statistic for the $t$ test, $`r beta_1_hat_t`$, can be squared to obtain the value of the $F$ statistic, $`r beta_1_hat_t ^ 2`$.
Note that there is another equivalent way to do this in `R`, which we will return to often to compare two models.
```{r}
anova(lm(dist ~ 1, data = cars), lm(dist ~ speed, data = cars))
```
The model statement `lm(dist ~ 1, data = cars)` applies the model $Y_i = \beta_0 + \epsilon_i$ to the cars data. Note that $\hat{y} = \bar{y}$ when $Y_i = \beta_0 + \epsilon_i$.
The model statement `lm(dist ~ speed, data = cars)` applies the model $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$.
We can then think of this usage of `anova()` as directly comparing the two models. (Notice we get the same p-value again.)
## `R` Markdown
The `R` Markdown file for this chapter can be found here:
- [`slr-inf.Rmd`](slr-inf.Rmd){target="_blank"}
The file was created using `R` version ``r paste0(version$major, "." ,version$minor)``.