-
Notifications
You must be signed in to change notification settings - Fork 82
/
09_GeneralStatistics.Rmd
1436 lines (1037 loc) · 46.1 KB
/
09_GeneralStatistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# General Statistics {#GeneralStatistics}
Introduction {-#intro-GeneralStatistics}
------------
Any significant application of R includes statistics or models or
graphics. This chapter addresses the statistics. Some recipes simply
describe how to calculate a statistic, such as relative frequency. Most
recipes involve statistical tests or confidence intervals. The
statistical tests let you choose between two competing hypotheses; that
paradigm is described next. Confidence intervals reflect the likely
range of a population parameter and are calculated based on your data
sample.
### Null Hypotheses, Alternative Hypotheses, and p-Values {-}
Many of the statistical tests in this chapter use a time-tested paradigm
of statistical inference. In the paradigm, we have one or two data
samples. We also have two competing hypotheses, either of which could
reasonably be true.
One hypothesis, called the *null hypothesis*, is that *nothing happened*:
the mean was unchanged; the treatment had no effect; you got the
expected answer; the model did not improve; and so forth.
The other hypothesis, called the *alternative hypothesis*, is that
*something happened*: the mean rose; the treatment improved the
patients’ health; you got an unexpected answer; the model fit better;
and so forth.
We want to determine which hypothesis is more likely in light of the
data:
1. To begin, we assume that the null hypothesis is true.
2. We calculate a test statistic. It could be something simple, such as
the mean of the sample, or it could be quite complex. The critical
requirement is that we must know the statistic’s distribution. We
might know the distribution of the sample mean, for example, by
invoking the Central Limit Theorem.
3. From the statistic and its distribution we can calculate a *p*-value,
the probability of a test statistic value as extreme or more extreme
than the one we observed, while assuming that the null hypothesis
is true.
4. If the *p*-value is too small, we have strong evidence against the
null hypothesis. This is called *rejecting* the null hypothesis.
5. If the *p*-value is not small, then we have no such evidence. This is
called *failing to reject* the null hypothesis.
There is one necessary decision here: When is a *p*-value “too small”?
> **Note**
>
> In this book, we follow the common convention that we reject the null
> hypothesis when *p* < 0.05 and fail to reject it when *p* >
> 0.05. In statistical terminology, we chose a significance level of *α*
> = 0.05 to define the border between strong evidence and insufficient
> evidence against the null hypothesis.
But the real answer is, “it depends.” Your chosen significance level
depends on your problem domain. The conventional limit of *p* < 0.05
works for many problems. In our work, the data are especially noisy and so
we are often satisfied with *p* < 0.10. For someone working in
high-risk areas, *p* < 0.01 or *p* < 0.001 might be necessary.
In the recipes, we mention which tests include a *p*-value so that you
can compare the *p*-value against your chosen significance level of *α*.
We worded the recipes to help you interpret the comparison. Here is the
wording from Recipe \@ref(recipe-id125), [Testing Categorical Variables for Independence](#recipe-id125), a test
for the independence of two factors:
> Conventionally, a *p*-value of less than 0.05 indicates that the
> variables are likely not independent whereas a *p*-value exceeding
> 0.05 fails to provide any such evidence.
This is a compact way of saying:
- The null hypothesis is that the variables are independent.
- The alternative hypothesis is that the variables are
not independent.
- For *α* = 0.05, if *p* < 0.05 then we reject the null hypothesis,
giving strong evidence that the variables are not independent; if
*p* > 0.05, we fail to reject the null hypothesis.
- You are free to choose your own *α*, of course, in which case your
decision to reject or fail to reject might be different.
Remember, the recipe states the *informal interpretation* of the test
results, not the rigorous mathematical interpretation. We use colloquial
language in the hope that it will guide you toward a practical
understanding and application of the test. If the precise semantics of
hypothesis testing is critical for your work, we urge you to consult the
reference cited under [See Also](#see_also_9) or one of the other fine
textbooks on mathematical statistics.
### Confidence Intervals {-}
Hypothesis testing is a well-understood mathematical procedure, but it
can be frustrating. First, the semantics is tricky. The test does not
reach a definite, useful conclusion. You might get strong evidence
against the null hypothesis, but that’s all you’ll get. Second, it does
not give you a number, only evidence.
If you want numbers then use confidence intervals, which bound the
estimate of a population parameter at a given level of confidence.
Recipes in this chapter can calculate confidence intervals for means,
medians, and proportions of a population.
For example, Recipe \@ref(recipe-id123), ["Forming a Confidence Interval for a Mean"](#recipe-id123),
calculates a 95% confidence interval for the population mean based on
sample data. The interval is 97.16 < *μ* < 103.98, which means
there is a 95% probability that the population’s mean, *μ*, is between
97.16 and 103.98.
### See Also {-#see_also_9}
Statistical terminology and conventions can vary. This book generally
follows the conventions of *Mathematical Statistics with Applications*,
6th ed., by Dennis Wackerly et al. (Duxbury Press). We recommend this book also
for learning more about the statistical tests described in this chapter.
Summarizing Your Data {#recipe-id119}
---------------------
### Problem {-#problem-id119}
You want a basic statistical summary of your data.
### Solution {-#solution-id119}
The `summary` function gives some useful statistics for vectors,
matrices, factors, and data frames:
``` {r, echo=FALSE}
vec <- rlnorm(1000)
```
``` {r}
summary(vec)
```
### Discussion {-#discussion-id119}
The Solution exhibits the summary of a vector. The `1st Qu.` and `3rd Qu.` are the first and third quartile,
respectively. Having both the median and mean is useful because you can
quickly detect skew. The preceding Solution, for example, shows a mean that is
larger than the median; this indicates a possible skew to the right, as one would expect from a lognormal distribution.
The summary of a matrix works column by column. Here we see the summary
of a matrix, `mat`, with three columns named `Samp1`, `Samp2`, and `Samp3`:
``` {r, echo=FALSE}
set.seed(666)
mat <- matrix(c(1:100, rnorm(100), rlnorm(100)),
ncol = 3, byrow = FALSE,
dimnames = list(NULL, c("Samp1", "Samp2", "Samp3"))
)
```
```{r}
summary(mat)
```
The summary of a factor gives counts:
``` {r,echo=FALSE}
fac <- as.factor(sample(c("Yes", "No", "Maybe"), 100, replace = TRUE))
```
``` {r}
summary(fac)
```
The summary of a character vector is pretty useless, just the vector length:
```{r, echo=FALSE}
char <- sample(c("Yes", "No", "Maybe"), 100, replace = TRUE)
```
```{r}
summary(char)
```
The summary of a data frame incorporates all these features. It works
column by column, giving an appropriate summary according to the column
type. Numeric values receive a statistical summary and factors are
counted (character strings are not summarized):
``` {r, message=FALSE, warning=FALSE}
suburbs <- read_csv("./data/suburbs.txt")
summary(suburbs)
```
The “summary” of a list is pretty funky: just the data type of each list member.
Here is a `summary` of a list of vectors:
``` {r, echo=FALSE}
vec_list <- list(x = rnorm(100), y = rnorm(100), z = sample(letters, 100, replace = TRUE))
```
``` {r}
summary(vec_list)
```
To summarize the data inside a list of vectors, map `summary` to each list element:
``` {r}
library(purrr)
map(vec_list, summary)
```
Unfortunately, the `summary` function does not compute any measure of
variability, such as standard deviation or median absolute deviation.
This is a serious shortcoming, so we usually call `sd` or `mad` (mean absolute deviation) right after calling `summary`.
### See Also {-#see_also-id119}
See Recipes \@ref(recipe-id101), ["Computing Basic Statistics"](#recipe-id101), and \@ref(recipe-id149), ["Applying a Function to Each List Element"](#recipe-id149).
Calculating Relative Frequencies {#recipe-id229}
--------------------------------
### Problem {-#problem-id229}
You want to count the relative frequency of certain observations in your
sample.
### Solution {-#solution-id229}
Identify the interesting observations by using a logical expression;
then use the `mean` function to calculate the fraction of observations
it identifies. For example, given a vector *x*, you can find the
relative frequency of positive values in this way:
``` {r, echo=FALSE}
x <- rlnorm(100)
```
``` {r}
mean(x > 3)
```
### Discussion {-#discussion-id229}
A logical expression, such as *x* > 3, produces a vector of logical
values (`TRUE` and `FALSE`), one for each element of *x*. The `mean`
function converts those values to 1s and 0s, respectively, and computes
the average. This gives the fraction of values that are `TRUE`—in other
words, the relative frequency of the interesting values. In the
Solution, for example, that’s the relative frequency of
values greater than 3.
The concept here is pretty simple. The tricky part is dreaming up a
suitable logical expression. Here are some examples:
`mean(lab == "NJ")`
: Fraction of `lab` values that are New Jersey
`mean(after > before)`
: Fraction of observations for which the effect increases
`mean(abs(x-mean(*x*)) > 2*sd(*x*))`
: Fraction of observations that exceed two standard deviations from
the mean
`mean(diff(ts) > 0)`
: Fraction of observations in a time series that are larger than the
previous observation
Tabulating Factors and Creating Contingency Tables {#recipe-id124}
--------------------------------------------------
### Problem {-#problem-id124}
You want to tabulate one factor or build a contingency table from
multiple factors.
### Solution {-#solution-id124}
The `table` function produces counts of one factor:
``` {r, echo=FALSE}
f1 <- factor(sample(letters[1:5], 100, replace = TRUE))
```
``` {r}
table(f1)
```
It can also produce contingency tables (cross-tabulations) from two or
more factors:
``` {r,echo=FALSE}
f2 <- factor(sample(letters[6:8], 100, replace = TRUE))
```
``` {r}
table(f1, f2)
```
`table` works for characters, too, not only factors:
``` {r}
t1 <- sample(letters[9:11], 100, replace = TRUE)
table(t1)
```
### Discussion {-#discussion-id124}
The `table` function counts the levels of one factor or characters, such as these
counts of `initial` and `outcome` (which are factors):
``` {r}
set.seed(42)
initial <- factor(sample(c("Yes", "No", "Maybe"), 100, replace = TRUE))
outcome <- factor(sample(c("Pass", "Fail"), 100, replace = TRUE))
table(initial)
table(outcome)
```
The greater power of `table` is in producing contingency tables, also
known as cross-tabulations. Each cell in a contingency table counts how
many times that row/column combination occurred:
``` {r}
table(initial, outcome)
```
This table shows that the combination of `initial = Yes` and
`outcome = Fail` occurred 13 times, the combination of `initial = Yes`
and `outcome = Pass` occurred 17 times, and so forth.
### See Also {-#see_also-id124}
The `xtabs` function can also produce a contingency table. It has a
formula interface, which some people prefer.
Testing Categorical Variables for Independence {#recipe-id125}
----------------------------------------------
### Problem {-#problem-id125}
You have two categorical variables that are represented by factors. You
want to test them for independence using the chi-squared test.
### Solution {-#solution-id125}
Use the `table` function to produce a contingency table from the two
factors. Then use the `summary` function to perform a chi-squared test
of the contingency table. In this example we have two vectors of factor values, which we created in the prior recipe:
``` {r}
summary(table(initial, outcome))
```
The output includes a *p*-value. Conventionally, a *p*-value of less
than 0.05 indicates that the variables are likely not independent,
whereas a *p*-value exceeding 0.05 fails to provide any such evidence.
### Discussion {-#discussion-id125}
This example performs a chi-squared test on the contingency table from Recipe \@ref(recipe-id124),
["Tabulating Factors and Creating Contingency Tables"](#recipe-id124), and
yields a *p*-value of 0.2:
``` {r}
summary(table(initial, outcome))
```
The large *p*-value indicates that the two factors, `initial` and
`outcome`, are probably independent. Practically speaking, we
conclude there is no connection between the variables. This makes sense, as this example data was created by simply drawing random data using the `sample` function in the prior recipe.
### See Also {-#see_also-id125}
The `chisq.test` function can also perform this test.
Calculating Quantiles (and Quartiles) of a Dataset {#recipe-id143}
--------------------------------------------------
### Problem {-#problem-id143}
Given a fraction *f*, you want to know the corresponding quantile of
your data. That is, you seek the observation *x* such that the fraction
of observations below *x* is *f*.
### Solution {-#solution-id143}
Use the `quantile` function. The second argument is the fraction, *f*:
``` {r, echo=FALSE}
vec <- rnorm(100)
```
``` {r}
quantile(vec, 0.95)
```
For quartiles, simply omit the second argument altogether:
``` {r}
quantile(vec)
```
### Discussion {-#discussion-id143}
Suppose `vec` contains 1,000 observations between 0 and 1. The
`quantile` function can tell you which observation delimits the lower 5%
of the data:
``` {r}
vec <- runif(1000)
quantile(vec, .05)
```
The `quantile` documentation refers to the second argument as a
“probability,” which is natural when we think of probability as meaning
relative frequency.
In true R style, the second argument can be a vector of probabilities;
in this case, `quantile` returns a vector of corresponding quantiles,
one for each probability:
``` {r}
quantile(vec, c(.05, .95))
```
That is a handy way to identify the middle 90% (in this case) of the
observations.
If you omit the probabilities altogether, then R assumes you want the
probabilities 0, 0.25, 0.50, 0.75, and 1.0—in other words, the
quartiles:
``` {r}
quantile(vec)
```
Amazingly, the `quantile` function implements nine (yes, nine) different
algorithms for computing quantiles. Study the help page before assuming
that the default algorithm is the best one for you.
Inverting a Quantile {#recipe-id142}
--------------------
### Problem {-#problem-id142}
Given an observation *x* from your data, you want to know its
corresponding quantile. That is, you want to know what fraction of the
data is less than *x*.
### Solution {-#solution-id142}
Assuming your data is in a vector `vec`, compare the data against the
observation and then use `mean` to compute the relative frequency of
values less than *x*—say, 1.6 as per this example.
``` {r, echo=FALSE}
vec <- rnorm(1000)
```
``` {r}
mean(vec < 1.6)
```
### Discussion {-#discussion-id142}
The expression `vec < *x*` compares every element of `vec` against `*x*` and
returns a vector of logical values, where the *n*th logical value is
`TRUE` if `vec[*n*] < *x*`. The `mean` function converts those logical
values to `0` and `1`: `0` for `FALSE` and `1` for `TRUE`. The average of all
those `1`s and `0`s is the fraction of `vec` that is less than `*x*`, or the
inverse quantile of `*x*`.
### See Also {-#see_also-id142}
This is an application of the general approach described in Recipe \@ref(recipe-id229),
["Calculating Relative Frequencies"](#recipe-id229).
Converting Data to z-Scores {#recipe-id133}
---------------------------
### Problem {-#problem-id133}
You have a dataset, and you want to calculate the corresponding
*z*-scores for all data elements. (This is sometimes called *normalizing*
the data.)
### Solution {-#solution-id133}
Use the `scale` function:
``` {r, echo=FALSE}
x <- rlnorm(10)
```
``` {r}
scale(x)
```
This works for vectors, matrices, and data frames. In the case of a
vector, `scale` returns the vector of normalized values. In the case of
matrices and data frames, `scale` normalizes each column independently
and returns columns of normalized values in a matrix.
### Discussion {-#discussion-id133}
You might also want to normalize a single value *y* relative to a
dataset *x*. You can do so by using vectorized operations as follows:
``` {r, echo=FALSE}
y <- .25
x <- rlnorm(100)
```
``` {r}
(y - mean(x)) / sd(x)
```
Testing the Mean of a Sample (t Test) {#recipe-id121}
---------------------------------------
### Problem {-#problem-id121}
You have a sample from a population. Given this sample, you want to know
if the mean of the population could reasonably be a particular value *m*.
### Solution {-#solution-id121}
Apply the `t.test` function to the sample *x* with the argument `mu = m`:
``` {r, echo=FALSE}
set.seed(42)
x <- rnorm(25)
m <- 0
```
``` {r, eval=FALSE}
t.test(x, mu = m)
```
The output includes a *p*-value. Conventionally, if *p* < 0.05 then
the population mean is unlikely to be *m*, whereas *p* > 0.05 provides
no such evidence.
If your sample size *n* is small, then the underlying population must be
normally distributed in order to derive meaningful results from the *t*
test. A good rule of thumb is that “small” means *n* < 30.
### Discussion {-#discussion-id121}
The *t* test is a workhorse of statistics, and this is one of its basic
uses: making inferences about a population mean from a sample. The
following example simulates sampling from a normal population with mean
*μ* = 100. It uses the *t* test to ask if the population mean could be
95, and `t.test` reports a *p*-value of 0.005:
``` {r}
x <- rnorm(75, mean = 100, sd = 15)
t.test(x, mu = 95)
```
The *p*-value is small and so it’s unlikely (based on the sample data)
that 95 could be the mean of the population.
Informally, we could interpret the low *p*-value as follows. If the
population mean were really 95, then the probability of observing our
test statistic (*t* = 2.8898 or something more extreme) would be only
0.005. That is very improbable, yet that is the value we observed.
Hence we conclude that the null hypothesis is wrong; therefore, the
sample data does not support the claim that the population mean is 95.
In sharp contrast, testing for a mean of 100 gives a *p*-value of
0.9:
``` {r}
t.test(x, mu = 100)
```
The large *p*-value indicates that the sample is consistent with
assuming a population mean *μ* of 100. In statistical terms, the data
does not provide evidence against the true mean being 100.
A common case is testing for a mean of zero. If you omit the `mu`
argument, it defaults to zero.
### See Also {-#see_also-id121}
The `t.test` function is a many-splendored thing. See Recipes \@ref(recipe-id123), ["Forming a Confidence Interval for a Mean"](#recipe-id123), and \@ref(recipe-id122),
["Comparing the Means of Two Samples"](#recipe-id122), for other uses.
Forming a Confidence Interval for a Mean {#recipe-id123}
----------------------------------------
### Problem {-#problem-id123}
You have a sample from a population. Given that sample, you want to
determine a confidence interval for the population’s mean.
### Solution {-#solution-id123}
Apply the `t.test` function to your sample *x*:
``` {r, eval=FALSE}
t.test(x)
```
The output includes a confidence interval at the 95% confidence level.
To see intervals at other levels, use the `conf.level` argument.
As in Recipe \@ref(recipe-id121), ["Testing the Mean of a Sample"](#recipe-id121), if your sample size *n* is small, then the
underlying population must be normally distributed for there to be a
meaningful confidence interval. Again, a good rule of thumb is that
“small” means *n* < 30.
### Discussion {-#discussion-id123}
Applying the `t.test` function to a vector yields a lot of output.
Buried in the output is a confidence interval:
``` {r, echo=FALSE}
set.seed(12)
x <- rnorm(50, mean = 100, sd = 15)
```
``` {r}
t.test(x)
```
In this example, the confidence interval is approximately 94.2 < *μ*
< 101.5, which is sometimes written simply as (94.2, 101.5).
We can raise the confidence level to 99% by setting `conf.level=0.99`:
``` {r}
t.test(x, conf.level = 0.99)
```
That change widens the confidence interval to 92.9 < *μ* <
102.8.
Forming a Confidence Interval for a Median {#recipe-id135}
------------------------------------------
### Problem {-#problem-id135}
You have a data sample, and you want to know the confidence interval for
the median.
### Solution {-#solution-id135}
Use the `wilcox.test` function, setting `conf.int=TRUE`:
``` {r, echo=FALSE}
x <- rnorm(25)
```
``` {r, eval=FALSE}
wilcox.test(x, conf.int = TRUE)
```
The output will contain a confidence interval for the median.
### Discussion {-#discussion-id135}
The procedure for calculating the confidence interval of a mean is
well defined and widely known. The same is not true for the median,
unfortunately. There are several procedures for calculating the median’s
confidence interval. None of them is “the” procedure, but the Wilcoxon signed rank test is pretty standard.
The `wilcox.test` function implements that procedure. Buried in the
output is the 95% confidence interval, which is approximately (–0.102,
0.646) in this case:
``` {r}
wilcox.test(x, conf.int = TRUE)
```
You can change the confidence level by setting `conf.level`, such as
`conf.level=0.99` or other such values.
The output also includes something called the *pseudomedian*, which is
defined on the help page. Don’t assume it equals the median; they are
different:
``` {r}
median(x)
```
### See Also {-#see_also-id135}
The bootstrap procedure is also useful for estimating the median’s
confidence interval; see Recipes \@ref(recipe-id199), ["Generating a Random Sample"](#recipe-id199,) and \@ref(recipe-id159),
["Bootstrapping a Statistic"](#recipe-id159).
Testing a Sample Proportion {#recipe-id225}
---------------------------
### Problem {-#problem-id225}
You have a sample of values from a population consisting of successes
and failures. You believe the true proportion of successes is *p*, and
you want to test that hypothesis using the sample data.
### Solution {-#solution-id225}
Use the `prop.test` function. Suppose the sample size is *n* and the
sample contains *x* successes:
``` {r, echo=FALSE}
n <- 25
x <- 10
p <- .5
```
``` {r,eval=FALSE}
prop.test(x, n, p)
```
The output includes a *p*-value. Conventionally, a *p*-value of less
than 0.05 indicates that the true proportion is unlikely to be *p*,
whereas a *p*-value exceeding 0.05 fails to provide such evidence.
### Discussion {-#discussion-id225}
Suppose you encounter some loudmouthed fan of the Chicago Cubs early in
the baseball season. The Cubs have played 20 games and won 11 of them,
or 55% of their games. Based on that evidence, the fan is “very
confident” that the Cubs will win more than half of their games this
year. Should he be that confident?
The `prop.test` function can evaluate the fan’s logic. Here, the number
of observations is *n* = 20, the number of successes is *x* = 11, and
*p* is the true probability of winning a game. We want to know whether
it is reasonable to conclude, based on the data, that *p* > 0.5.
Normally, `prop.test` would check for *p* ≠ 0.05, but we can check for
*p* > 0.5 instead by setting `alternative="greater"`:
``` {r}
prop.test(11, 20, 0.5, alternative = "greater")
```
The `prop.test` output shows a large *p*-value, 0.55, so we cannot
reject the null hypothesis; that is, we cannot reasonably conclude that
*p* is greater than 1/2. The Cubs fan is being overly confident based on
too little data. No surprise there.
Forming a Confidence Interval for a Proportion {#recipe-id226}
----------------------------------------------
### Problem {-#problem-id226}
You have a sample of values from a population consisting of successes
and failures. Based on the sample data, you want to form a confidence
interval for the population’s proportion of successes.
### Solution {-#solution-id226}
Use the `prop.test` function. Suppose the sample size is *n* and the
sample contains *x* successes:
``` {r, echo=FALSE}
set.seed(19)
n <- 75
x <- rbinom(1, size = n, prob = .5)
```
``` {r, eval=FALSE}
prop.test(x, n)
```
The function output includes the confidence interval for *p*.
### Discussion {-#discussion-id226}
We subscribe to a stock market newsletter that is well written,
but includes a section purporting to identify stocks that are
likely to rise. It does this by looking for a certain pattern in the
stock price. It recently reported, for example, that a certain stock was
following the pattern. It also reported that the stock rose six times
after the last nine times that pattern occurred. The writers concluded
that the probability of the stock rising again was therefore 6/9, or
66.7%.
Using `prop.test`, we can obtain the confidence interval for the true
proportion of times the stock rises after the pattern. Here, the number
of observations is *n* = 9 and the number of successes is *x* = 6. The
output shows a confidence interval of (0.309, 0.910) at the 95%
confidence level:
``` {r}
prop.test(6, 9)
```
The writers are pretty foolish to say the probability of rising is
66.7%. They could be leading their readers into a very bad bet.
By default, `prop.test` calculates a confidence interval at the 95%
confidence level. Use the `conf.level` argument for other confidence
levels:
``` {r, echo=FALSE}
n <- 100
x <- rbinom(1, size = n, prob = .5)
p <- .5
```
``` {r, eval=FALSE}
prop.test(x, n, p, conf.level = 0.99) # 99% confidence level
```
### See Also {-#see_also-id226}
See Recipe \@ref(recipe-id225), ["Testing a Sample Proportion"](#recipe-id225).
Testing for Normality {#recipe-id126}
---------------------
### Problem {-#problem-id126}
You want a statistical test to determine whether your data sample is
from a normally distributed population.
### Solution {-#solution-id126}
Use the `shapiro.test` function:
``` {r, echo=FALSE}
set.seed(3)
x <- rnorm(200)
```
``` {r, eval=FALSE}
shapiro.test(x)
```
The output includes a *p*-value. Conventionally, *p* < 0.05 indicates
that the population is likely not normally distributed, whereas *p* >
0.05 provides no such evidence.
### Discussion {-#discussion-id126}
This example reports a *p*-value of .7765 for *x*:
``` {r}
shapiro.test(x)
```
The large *p*-value suggests the underlying population could be normally
distributed. The next example reports a very small *p*-value for *y*, so it
is unlikely that this sample came from a normal population:
``` {r, echo=FALSE}
y <- rlnorm(100)
```
``` {r}
shapiro.test(y)
```
We have highlighted the Shapiro–Wilk test because it is a standard R
function. You can also install the package `nortest`, which is dedicated
entirely to tests for normality. This package includes:
- Anderson–Darling test (`ad.test`)
- Cramer–von Mises test (`cvm.test`)
- Lilliefors test (`lillie.test`)
- Pearson chi-squared test for the composite hypothesis of normality
(`pearson.test`)
- Shapiro–Francia test (`sf.test`)
The problem with all these tests is their null hypothesis: they all
assume that the population is normally distributed until proven
otherwise. As a result, the population must be decidedly non-normal
before the test reports a small *p*-value and you can reject that null
hypothesis. That makes the tests quite conservative, tending to err on
the side of normality.
Instead of depending solely upon a statistical test, we suggest also
using histograms (Recipe \@ref(recipe-id182), ["Creating a Histogram"](#recipe-id182)) and quantile-quantile plots
(Recipe \@ref(recipe-id180), ["Creating a Normal Quantile–Quantile Plot"](#recipe-id180)) to evaluate the normality of any data. Are the
tails too fat? Is the peak too peaked? Your judgment is likely better
than a single statistical test.
### See Also {-#see_also-id126}
See Recipe \@ref(recipe-id012), ["Installing Packages from CRAN"](#recipe-id012), for how to install the `nortest` package.
Testing for Runs {#recipe-id131}
----------------
### Problem {-#problem-id131}
Your data is a sequence of binary values: yes/no, 0/1, true/false, or
other two-valued data. You want to know: Is the sequence random?
### Solution {-#solution-id131}
The `tseries` package contains the `runs.test` function, which checks a
sequence for randomness. The sequence should be a factor with two
levels:
``` {r, echo=FALSE}
library(tseries)
s <- factor(sample(c("T", "F"), 20, replace = TRUE))
```
``` {r, eval=FALSE}
library(tseries)
runs.test(as.factor(s))
```
The `runs.test` function reports a *p*-value. Conventionally, a
*p*-value of less than 0.05 indicates that the sequence is likely not
random, whereas a *p*-value exceeding 0.05 provides no such evidence.
### Discussion {-#discussion-id131}
A run is a subsequence composed of identical values, such as all `1`s or
all `0`s. A random sequence should be properly jumbled up, without too
many runs. Similarly, it shouldn’t contain too *few* runs, either. A
sequence of perfectly alternating values (`0`, `1`, `0`, `1`, `0`, `1`, ...)
contains no runs, but would you say that it’s random?
The `runs.test` function checks the number of runs in your sequence. If
there are too many or too few, it reports a small *p*-value.
This first example generates a random sequence of `0`s and `1`s and then
tests the sequence for runs. Not surprisingly, `runs.test` reports a
large *p*-value, indicating the sequence is likely random:
``` {r}
s <- sample(c(0, 1), 100, replace = T)
runs.test(as.factor(s))
```
This next sequence, however, consists of three runs and so the reported
*p*-value is quite low:
``` {r}
s <- c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0)
runs.test(as.factor(s))
```
### See Also {-#see_also-id131}
See Recipes \@ref(recipe-id051), ["Creating a Factor"](#recipe-id051), and \@ref(recipe-id200), ["Generating Random Sequences"](#recipe-id200).
Comparing the Means of Two Samples {#recipe-id122}
----------------------------------
### Problem {-#problem-id122}
You have one sample each from two populations. You want to know if the
two populations could have the same mean.
### Solution {-#solution-id122}
Perform a *t* test by calling the `t.test` function:
``` {r, echo=FALSE}
x <- rnorm(100, mean = 0, sd = 1)
y <- rnorm(100, mean = 0, sd = 2)
```
``` {r, eval=FALSE}
t.test(x, y)
```
By default, `t.test` assumes that your data are not paired. If the
observations are paired (i.e., if each *x~i~* is paired with one
*y~i~*), then specify `paired=TRUE`:
``` {r, eval=FALSE}
t.test(x, y, paired = TRUE)
```
In either case, `t.test` will compute a *p*-value. Conventionally, if
*p* < 0.05 then the means are likely different, whereas *p* > 0.05
provides no such evidence:
- If either sample size is small, then the populations must be
normally distributed. Here, “small” means fewer than 20 data points.
- If the two populations have the same variance, specify