-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathStat639V-FinalProjectCS.Rmd
2765 lines (2267 loc) · 134 KB
/
Stat639V-FinalProjectCS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: \vspace{3.5in} Survival Analysis of Breastfeeding Cessation
author: "Carson Stacy"
date: "`r format(Sys.time(), '%B %d, %Y')`"
subtitle: "STAT 639V Survival Analysis Final Project"
output:
pdf_document: default
toc: true
bibliography: references.bib
header-includes:
\newcommand{\beginsupplement}{\setcounter{table}{0}\renewcommand{\thetable}{S\arabic{table}}\setcounter{figure}{0} \renewcommand{\thefigure}{S\arabic{figure}}}
---
\newpage
\tableofcontents
\newpage
# Introduction
Breastfeeding has been a topic of academic and public interest since the invention of formula and the feeding bottle in the 19th century [@stevens2009history]. Since the invention of baby formula, breastfeeding rates have reduced dramatically. The breastfeeding rate was 90% in the 20th century, but has decreased to approximately 37% in the 21st century [@gaynor2003breastfeeding; @victora2016breastfeeding]. This trend has many scientists concerned [@mcfadden2016spotlight; @perez2020breastfeeding; @pomeranz2021breastmilk].
The importance of breastfeeding in low and middle-income nations is widely acknowledged. In low-income nations, unclean water in formula is a death sentence for an infant [@perez2012scaling]. Perhaps this partially explains why the prevalence of breastfeeding is higher in low and middle-income nations than in high-income nations. A large body of scientific research spanning public health studies to cell biology experiments show the importance of promoting infant breastfeeding everywhere [@hoddinott2008breast; @walters2019cost]. From kick-starting the infant's gut microbiome via human milk oligosaccharides, to the transfer of important immune molecules (e.g. IgA) to transfer of stem cells [@pannaraj2017association; @hassiotou2014dawn] and micro-RNAs from mother to infant suggested to regulate infant gene expression [@munch2013transcriptome; @van2020impact]. Beyond positive impacts on the child's current and future health, benefits have been shown for the breast feeding mother as well [@leon2002quantifying]. From the [World Health Organization](https://www.who.int/health-topics/breastfeeding#tab=tab_1) to the American Academy of Pediatrics [@section2012breastfeeding], most doctors and organizations avidly support exclusive breastfeeding during the first six month of an infant's life.
It is apparent that breastfeeding is important for health -- or is it? Despite the ubiquity of recommendations regarding breastfeeding, there exists less high quality data on the topic than might be expected. Ethical considerations make the gold standard double-blind experimental design a non-starter, so observational studies and their confounding baggage are the norm in breastfeeding literature. A hallmark study in the 1990's in Belarus called the PROBIT trial involved 17,000 mothers which were experimentally "treated" with promotion of breastfeeding while the control group was not [@kramer2001promotion]. The results of this trial were mixed. In the context of immediate health benefits of the child, breast feeding showed a significant reduction in: number of gastrointestinal infections, likelihood of eczema and other rashes. However, no significant differences were seen in any other considered outcomes (e.g., respiratory infections, ear infections, wheezing, mortality). Regarding long-term outcomes, the PROBIT trial found no effect on any long-term outcomes measured. Sibling studies, which compare outcomes of siblings pairs where one was breastfed while the other bottle fed, find no impact on any measured outcomes [@colen2014breast; @raissian2018best].
It has been argued that the differences seen in many observational studies comparing breast and bottle fed infants are the result of maternal selection. In other words, mothers are not deciding randomly whether to feed their infants with breast or bottle. In the US, mothers who breastfeed tend to be more highly educated and wealthier than mothers who bottle feed. A recent study suggests
> *"...most physical health benefits associated with breastfeeding are likely attributable to demographic characteristics such as race and socioeconomic status, and other difficult to measure unobservable characteristics." -* (Raissan and Su, 2018)
The controversy is not against breastfeeding, especially in low-income nations, rather it is promoting communication evidence-based of the magnitude of benefits of breastfeeding.
It is in the context of thinking about a mother's breastfeeding decisions through a socioeconomic lens that this project examines time to cessation of breastfeeding data of new mothers from the National Longitudinal Survey of Youth (NSLY, 1995). A finding that demographic factors have no effect on time to cessation of breastfeeding would be unexpected based on the claims of @raissian2018best. A finding of significant differences does not confirm their assertions, but rather provides valuable information about relevant demographic variables related to breast feeding cessation and context for considering some observational research finding drastic benefits of breast feeding. Additionally, this analysis shows the utility of sfurvival analysis methodology in time-to-event scenarios such as breast feeding cessation.
## The Data
\vspace{-2truemm}
This project utilizes data on breastfeeding decisions of young mothers compiled from the National Longitudinal Survey of Youth (NLSY, 1995) personal interviews conducted by the United States Bureau of Labor Statistics branch of the US Department of Labor. All NLSY files are public access, and can be downloaded from <http://www.bls.gov/nls/nlsy79.html>. The data set was compiled as part of the text *Survival Analysis Techniques for Censored and truncated data* by @klein2003survival, available in the `KMsurv` package as `bfeed`.
```{r, quietly=TRUE, echo=FALSE, warning=FALSE, message=FALSE}
library(tidyverse) # package for data analysis
library(ggsci) # for graph colors
library(KMsurv) # package with data set
library(survival) # package for survival analysis
library(survminer) # for ggsurvplot viz
library(rstatix) # package for piped stat tests
data("bfeed") # load data set into R
```
\vspace{-2truemm}
### Variables
\vspace{-3truemm}
The data is comprised of data from `r toString(nrow(bfeed))` new mothers, with `r toString(ncol(bfeed))` variables recorded for each individual. Descriptions for each variable recorded can be seen in Table 1 below. There are six categorical variables, of which only `race` of mother has more than two categories. There are 4 numerical variables, all discrete integers with a sufficient number of values to loosely approximate continuity.
```{r, echo=FALSE}
# table 1
var_names <- tibble(
c(1:ncol(bfeed)),
c(colnames(bfeed)),
c("duration of breastfeeding (weeks)",
"indicator of child weaning",
"race of mother",
"mother in poverty",
"mother smoked at birth of child",
"mother used alcohol at birth of child",
"age of mother at birth of child",
"year of birth",
"education level of mother (years of school)",
"prenatal care after 3rd month"
)
)
knitr::kable(var_names,
col.names = c("No.", "Variable ID", "Variable definition"),
caption = "List of variable IDs and their definitions"
)
```
```{r, echo=FALSE}
# data processing
bfeed_og <- bfeed %>%
mutate(SurvObj = Surv(duration, delta == 1) )
# raw data processing
bfeed <- bfeed %>%
mutate(race = factor(
recode(race, "1" = "white", "2" = "black", "3"="other"),
levels = c("white", "black", "other")
),
status = delta,
delta = factor(
recode(delta, "1" = "yes", "0" = "no"),
levels = c("no", "yes")
),
poverty = factor(
recode(poverty, "1" = "yes", "0" = "no"),
levels = c("no", "yes")
),
smoke = factor(
recode(smoke, "1" = "yes", "0" = "no"),
levels = c("no", "yes")
),
alcohol = factor(
recode(alcohol, "1" = "yes", "0" = "no"),
levels = c("no", "yes")
),
pc3mth = factor(
recode(pc3mth, "1" = "yes", "0" = "no"),
levels = c("no", "yes")
),
#convert years of education to an ordered categorical variable
education = factor(
recode(
cut(yschool,
breaks = c(0,11.5, 12.5, max(yschool)),
labels = F
),
"1" = "<HS", "2" = "HS",
"3" = ">HS"),
levels = c("<HS", "HS", ">HS")
),
ybirth = ybirth + 1900 , # making data more exact for visualization
SurvObj = Surv(duration, delta == "yes"), #add Survival Obj variable
race_o = case_when(
race == "other" ~ 1,
race != "other" ~ 0),
race_b = case_when(
race == "black" ~ 1,
race != "black" ~ 0)
)
```
The *event* in this data is self-reported cessation of breastfeeding of new mothers interviewed. In the context of time-to-event analysis, the indicator variable for whether or not breastfeeding had been ceased at the time of the interview was `delta`, and the time from birth of the child to cessation of breastfeeding is coded as the variable `duration`. If the mother has not yet stopped breastfeeding the child at the time of the interview, then the `duration` variable instead represents time from birth of the child to time of the interview. In the context of survival analysis, these data are considered to be right-censored. In other words, for these patients the study stopped before the event of stopping breastfeeding had not yet occurred. It is unclear the exact definition of *completing breast feeding* utilized in the survey methodology. Whether this means the end of utilizing breast milk as the sole food source for the child vs completely removing breast milk from the infant's diet. To summarize, variables three through ten in Table 1 are candidate predictors for variable one while accounting for the censoring of some participants indicated via variable two.
\vspace{-2truemm}
### Experimental Design
\vspace{-3truemm}
Data on breastfeeding used in this study has been extracted from a large set of surveys sampling several thousand individuals, many of whom have been surveyed over decades. The NLSY79 Child and Young Adult surveys include a wide variety of information on children born to female respondents of the NLSY79 surveys. Parents reported in interviews on many aspects of the raising of their child, among that corpus of information are the data shown here.
Participants were chosen randomly from the United States population for the study, so responses from all 50 states and outlying territories are included in the sample. Detailed information about the design of the survey is available at <https://www.bls.gov/nls/nlsy79.htm#intro-to-sample>. Relevant surveys were conducted from 1979 through 1986 and questions related to breastfeeding were asked to mothers who had given birth in the past 12 months. Information about duration of breastfeeding was provided by mothers via memory recall. Responses to other variables (e.g. smoking at the time of birth) were also provided by the mother. A sample of the data itself can be seen in Table 2. The `SurvObj` variable combines the `duration` and `delta` variables to give duration with participants who were still breastfeeding at last interview denoted with the `+` symbol.
```{r, echo=FALSE}
#table 2
bfeed %>%
select(-c(race_o, race_b, education, status)) %>%
head(n=5) %>%
knitr::kable(caption = "Time to cessation of breastfeeding data set")
```
### Censoring and Missing Values
\vspace{-3truemm}
```{r, echo=FALSE}
n_censored <- bfeed %>%
filter(delta=="no") %>%
count() %>%
pull()
```
In this data set, a total of `r toString(n_censored)` mothers were still breastfeeding their infant at the time of their final data collection interview. Most of these censoring events occurred in the final year(s) of the study, when time from birth to final interview was significantly less than the nearly 10 years from birth of child to final interview of the earliest participants in the study. The compiled dataset does not contain any information about possible patient drop-out or missed interviews.
```{r, fig.cap = "Histograms of numerical variable distributions.", echo=FALSE}
# Fig1a
fig1 <- bfeed %>%
select(-c(status, SurvObj, race_o, race_b)) %>% # artificial duplicate variables
rename("Age of Mother at Birth" = agemth, # change names
"Duration of Breastfeeding (weeks)" = duration,
"Breastfeeding Completed" = delta,
"Year Child was Born" = ybirth,
"Years of Schooling - Mother" = yschool,
"Race of Mother" = race,
"Poverty" = poverty,
"Alcohol" = alcohol,
"Prenatal Care after 3rd Month" = pc3mth,
"Years of Education - Mother" = yschool
) %>%
select_if(is.numeric) %>%
gather() %>%
ggplot(aes(value))+
geom_histogram(bins = 30)+
facet_wrap(~key, scales = "free") +
theme_minimal() +
scale_x_continuous(n.breaks = 10)+
labs(x=" ", y="Number of Participants") +
theme(plot.title = element_text(size = 12)) +
ggtitle("Distribution of numerical variables")
```
```{r echo=FALSE, fig.cap="Distributions of categorical and numerical variables comprising the data set.", fig.height=5, fig.width=6.5, warning=FALSE}
#fig1b
fig2 <- bfeed %>%
select(-education) %>%
rename("Age of Mother at Birth" = agemth, # change names
"Duration of Breastfeeding (weeks)" = duration,
"Breastfeeding Completed" = delta,
"Year Child was Born" = ybirth,
"Years of Schooling - Mother" = yschool,
"Race of Mother" = race,
"Poverty" = poverty,
"Alcohol" = alcohol,
"Prenatal Care after 3rd Month" = pc3mth,
"Years of Education - Mother" = yschool
) %>%
select_if(negate(is.numeric)) %>%
gather() %>%
ggplot(aes((value)))+
geom_bar()+
facet_wrap(~key, scales = "free")+
theme_minimal() +
# scale_fill_grey() +
theme(legend.position = "none") +
ggtitle("Distribution of categorical variables") +
theme(plot.title = element_text(size = 12)) +
labs(x = "Participant Response", y = "Number of Participants")
ggarrange(fig1, fig2, nrow =2)
```
# Methods and Data Analysis
\vspace{-2truemm}
The techniques of survival analysis allow for useful descriptions of time-to-event data. The primary function of survival analysis is the probability of survival beyond time `t`, called the survival function.
$$
S(t) = Pr(T > t) = 1 - F(t)
$$
where T is the random variable survival time, in this case T represents the duration of breast feeding in weeks. A characteristic of the survival function $S(t)$ is that it is the complement of the cumulative density function (CDF) $F(t)$ which itself is the integral of the probability density function $f(t)$ from 0 to 1.
Another essential function for analyzing time-to-event data is the hazard function, which is the instantaneous rate of event occurrence at time t given survival to time t,
$$
\displaystyle h(t) = \lim_{\Delta{t} \to 0} \frac{Pr(t \leq T < t + \Delta{t} \mid T \geq t)}{\Delta{t}}
$$
where $h(t)$ is the hazard function.
Survival analysis involves estimating these functions based on the data. There exist non-parametric, parametric, and semi-parametric models for estimating the survival function. In this project, the non-parametric method is the Kaplan-Meier (KM) estimator. The log-rank test is used in testing for statistical differences between KM curves. The parametric method is fitting to a distribution (e.g., Weibull or exponential) and using the relationships above to estimate the survival or hazard curves. Finally, perhaps the most common technique in survival analysis is the Cox proportional hazards model, a semi-parametric approach. Each of these approaches are utilized below. The predefined significance level $\alpha$ will be set at 0.05 for determining significance for this analysis unless otherwise specified. Prior to analyzing the data, a visual summary of the data set is available in Figure 1 above.
## Kaplan Meier Survival Estimates
The Kaplan-Meier curve shows an estimate of the time to an event, which here represents the time in weeks until a mother stops breastfeeding her child. The Kaplan-Meier estimator for the survival function is:
$$
\hat{S} =
\begin{cases}
1 & t < t_1 \\
\prod_{t \geq t_i} [1 - \frac{d_i}{Y_i}] & t \geq t_1 \\
\end{cases}
$$
where $1 \leq d_i \leq Y_i$ with $t_i$ representing the distinct time at which breastfeeding ceased, $Y_i$ representing the number of individuals still breastfeeding at time $t_i$, and $d_i$ is the number of individuals who stopped breastfeeding at time $t_i$.
The Kaplan-Meier curve drops only when an individual stops breastfeeding, not when they are censored. The survival function, as well as its estimators, are bound in value from between 0 and 1. Confidence interval estimates for the KM curve can be estimated via variance. A Kaplan-Meier curve of the entire survey sample can be seen in Figure 3 below.
```{r KMcurve_all, fig.cap="KM curve for all participants for duration of breastfeeding. Gray shaded region represents a pointwise 95% confidence interval for the survival curve. The symbol + corresponds to a censoring time. The number of participants still breastfeeding at a given number of weeks is shown below the survival curve plot.", echo=FALSE, warning=FALSE}
#fig2
# create survival object:
km.as.one <- survfit(SurvObj ~ 1, data = bfeed)
#KM plot combining all participants
gg.km.as.one <- ggsurvplot(
km.as.one, # survfit object with calculated statistics.
data = bfeed, # data used to fit survival curves.
risk.table = TRUE, # show risk table.
#pval = TRUE, # show p-value of log-rank test.
# conf.int = TRUE, # show confidence intervals for
# point estimates of survival curves.
xlim = c(0,200), # present narrower X axis, but not affect
# survival estimates.
palette = "black", # make combined curve black
xlab = "Time (weeks)", # customize X axis label.
break.time.by = 20, # break X axis in time intervals by 20.
ggtheme = theme_minimal(),# customize plot and risk table with a theme.
risk.table.y.text.col = T, # colour risk table text annotations.
risk.table.y.text = FALSE, # show bars instead of names in text annotations
# in legend of risk table
# palette = "uchicago", # change colors to be pretty
log.rank.weights = "S1", # Peto Peto test for log-rank test
pval.method.coord = c(120,90), # location of p-value text
pval.method.size = 3, # p-val text size
# title = "KM Curve - Duration of Breast Feeding",
legend = "none",
fun = "pct" #show survival function as percentage
)
gg.km.as.one$plot <- gg.km.as.one$plot +
geom_vline(xintercept=26, color="red", linetype = "dashed") + # add 6mo marker (WHO recommendation)
geom_label(label="6 months", x=40, y = 75)
gg.km.as.one
# get 6 month survival time estimate
pct6mo.all <- gg.km.as.one$data.survplot %>%
filter(time == 26) %>%
pull(surv)
#upperbound
u6mo.all <- gg.km.as.one$data.survplot %>%
filter(time == 26) %>%
pull(upper)
# lowerbound for est
l6mo.all <- gg.km.as.one$data.survplot %>%
filter(time == 26) %>%
pull(lower)
```
The red dashed line on these curves corresponds to the 6 months of breastfeeding milestone, which is the WHO recommendation for breastfeeding children [@grummer2017new]. Based on the KM curve in Figure 2, only `r toString(round(pct6mo.all,2)*100)`% of mothers interviewed reported breastfeeding their children at least 6 months CI = (`r toString(round(l6mo.all,2)*100)`% , `r toString(round(u6mo.all,2)*100)`%). The KM estimate for median duration of breastfeeding for all mothers was `r toString(surv_median(km.as.one)$median)` weeks, which is `r toString(26 - surv_median(km.as.one)$median)` weeks less than the WHO recommended 6 months minimum.
### KM curve - Cross Group Comparisons
```{r , echo=FALSE}
#prep fig3
km.by.smoke <- survfit(SurvObj ~ smoke, data = bfeed)
#KM curve according to smoking
gg.km.by.smoke <- ggsurvplot(
km.by.smoke, # survfit object with calculated statistics.
data = bfeed, # data used to fit survival curves.
risk.table = TRUE, # show risk table.
pval = TRUE, # show p-value of log-rank test.
pval.coord = c(120,80), # location of pval
pval.method = TRUE, # show type of pval shown
conf.int = TRUE, # show confidence intervals for
# point estimates of survival curves.
xlim = c(0,200), # present narrower X axis, but not affect
# survival estimates.
xlab = "Time in weeks", # customize X axis label.
break.time.by = 20, # break X axis in time intervals by 500.
ggtheme = theme_minimal(),# customize plot and risk table with a theme.
risk.table.y.text.col = T, # colour risk table text annotations.
risk.table.y.text = FALSE, # show bars instead of names in text annotations
# in legend of risk table
# palette = "uchicago", # change colors to be pretty
log.rank.weights = "1", # normal log-rank test
pval.method.coord = c(120,90), # location of p-value text
pval.method.size = 3, # p-val text size
# title = "KM Curve - Smoking",
fun = "pct" #show survival function as percentage
)
gg.km.by.smoke$plot <- gg.km.by.smoke$plot +
geom_vline(xintercept=26, color="red", linetype = "dashed") + # add 6mo marker (WHO recommendation)
geom_label(label="6 months", x=40, y = 75)
# get 6 month survival time estimate smoke
pct6mo.smoke <- gg.km.by.smoke$data.survplot %>%
filter(time == 26) %>%
pull(surv)
#upperbound
u6mo.smoke <- gg.km.by.smoke$data.survplot %>%
filter(time == 26) %>%
pull(upper)
# lowerbound for est
l6mo.smoke <- gg.km.by.smoke$data.survplot %>%
filter(time == 26) %>%
pull(lower)
```
In the survey, mothers were asked whether they were smoking at the time that they gave birth to their child. Kaplan-Meier curves for mothers who reported smoking compared to those who did not can be seen in Figure 3. It is possible to compare the point estimates for the proportion of mothers in the smoking vs nonsmoking group who were still breastfeeding at 6 months. For mothers in the smoking group, `r toString(round(pct6mo.smoke[[2]],2)*100)`% of mothers interviewed reported breastfeeding their children at least 6 months CI = (`r toString(round(l6mo.smoke[[2]],2)*100)`% , `r toString(round(u6mo.smoke[[2]],2)*100)`%). In the group of non-smoking mothers, `r toString(round(pct6mo.smoke[[1]],2)*100)`% reported breastfeeding their children at least 6 months CI = (`r toString(round(l6mo.smoke[[1]],2)*100)`% , `r toString(round(u6mo.smoke[[1]],2)*100)`%). The question then arises whether the difference between these two curves is significant. The overlapping point estimate confidence intervals at 6 months suggests the difference may be due to chance; however, survival analysis provides a more robust way to test for a difference between these two groups.
```{r smoke, echo=FALSE, fig.cap="KM curve for duration of breastfeeding according to whether mother smoked when child was born. Shaded regions represents a pointwise 95% confidence interval for the survival curves. The symbol + corresponds to a censoring time. The number of participants still breastfeeding by group at a given number of weeks is shown below the survival curve plot."}
#figure3
gg.km.by.smoke
```
\vspace{-3truemm}
#### Log-rank test
\
\vspace{-5truemm}
The log-rank test ($H_0$: no difference) can be used to compare if the difference between these KM curves is significant. We can see in Figure 3 above that there appears to be a difference in survival curves between mothers who smoked at birth of their child and mothers who did not. Given differences at early time points are considered more important from a public health perspective, the peto-peto modification would be an appropriate tool to use for testing for a difference at earlier time points in this data.
Some categorical variables are composed of more than two variables. The chi-squared generalization of the log-rank test, implemented in the R package `survival` with the function `survdiff()` can be utilized, testing for whether at least one of the curves is significantly different. In the supplemental materials section, KM curves for other variables along with their corresponding peto-peto test p-values for difference are available as Figure S1.
Categorical variables that appear to have a significant effect on survival based on the log-rank test of KM curves are the race of the mother and their smoking status (Table 3).
```{r, echo=FALSE}
#table 3
# calculate log-rank test p-values for each categorical variable
# race
lr_p_race <- pchisq(
survdiff(SurvObj ~ race, bfeed)$chisq,
length(survdiff(SurvObj ~ race, bfeed)$n)-1, lower.tail=F
)
# poverty
lr_p_poverty <- pchisq(
survdiff(SurvObj ~ poverty, bfeed)$chisq,
length(survdiff(SurvObj ~ poverty, bfeed)$n)-1, lower.tail=F
)
# smoke
lr_p_smoke <- pchisq(
survdiff(SurvObj ~ smoke, bfeed)$chisq,
length(survdiff(SurvObj ~ smoke, bfeed)$n)-1, lower.tail=F
)
# alcohol
lr_p_alcohol <- pchisq(
survdiff(SurvObj ~ alcohol, bfeed)$chisq,
length(survdiff(SurvObj ~ alcohol, bfeed)$n)-1, lower.tail=F
)
# prenatal care after 3mo
lr_p_pc3mth <- pchisq(
survdiff(SurvObj ~ pc3mth, bfeed)$chisq,
length(survdiff(SurvObj ~ pc3mth, bfeed)$n)-1, lower.tail=F
)
log_rank_pvals <- tibble(
"Variable ID" = c("race", "poverty", "smoke", "alcohol", "pc3mth"),
"p-value" = c(lr_p_race, lr_p_poverty, lr_p_smoke, lr_p_alcohol, lr_p_pc3mth)
)
log_rank_pvals %>%
add_significance("p-value") %>%
knitr::kable(caption = "p-values of log-rank test for difference in breastfeeding duration according to single variable",
label = "ns corresponds to a p-value greater than 0.05, * a p-value less than 0.05, and ** a p-value less than 0.01 for log-rank test for difference based on variable ID.")
```
The KM curve is not the ideal approach for resolving the effect of numerical predictor variables on the response variable breastfeeding duration. Other approaches described below are better suited to elucidating these types of variables.
### Kaplan-Meier Curve Assumptions
There are three major assumptions of the Kaplan-Meier estimator: first, that the censored participants have the same breastfeeding duration as the participants who are not censored; second, that the time of recruitment into the study does not effect the survival outcomes; and lastly, that events happened when they are said to have happened[@goel2010understanding]. Given the nature of the data, the first assumption appears to be reasonable. The second assumption will be shown by subsequent analysis below to be perhaps a poor assumption. The final assumption is likely not entirely true, but ideally the errors in recall of participants will be randomly distributed throughout the survey sample, reducing the effect of incorrect information.
Given that earlier ages of stopping breastfeeding are considered biologically more important, the peto-peto modification would better resolve differences early in the estimated survival curves. Regardless of the version of the log-rank test used, the assumptions for these tests are based on the KM assumptions, so the use of these tests is appropriate under weak assumptions.
## Parametric Survival Estimates
Parametric modelling is a powerful tool for analyzing a wide variety. Linear regression is perhaps the most widely known parametric regression model in the scientific research community. Parametric models also exist for analyzing time-to-event data such as time to cessation of breastfeeding discussed here. Moving beyond univariate analyses, parametric survival models allow for description of numerical and multivariate models. There are several models that exist for modeling time-to-event data. Three parametric models will be fit to the data: the exponential, Weibull, and lognormal models. The exponential and Weibull models are interesting in that they can utilize either the accelerated failure time (AFT) assumption or the proportional hazards (PH) assumption for describing survival data. To summarize briefly, proportional hazards assumes that the hazard ratio for any two individuals is constant over time while the AFT model assumes that the effects of covariates are fixed and multiplicative by the acceleration factor on the time scale of $t$. The commonality across these parametric models is that they assume the outcome follows some known distribution.
<!-- (e.g., assumptions, statistical models, parameter estimations, goodness-of-fit test, cross-group comparisons, predictions and validations, etc) -->
### Exponential
The exponential model is a less complicated model because its function is time-independent:
$$
h(t) = \lambda
$$
and
$$
S(t) = e^{-\lambda t}
$$
where $h(t)$ is the hazard function and $S(t)$ is the survival function. These are the AFT parameterizations of the exponential model. Note that the the hazard is a constant $\lambda$.
#### Assumptions of the exponential survival model \
The key assumption of the exponential survival model are that the the hazard rate is constant, derived from the memoryless property of the exponential distribution. Based on what is known about cessation of breastfeeding, this would not seem likely to be a valid assumption; however, figure 4 shows the exponential model provides a reasonable fit to the data. Trends between the different racial groups correspond to regression outputs below showing similar survival for these three groups.
```{r exponentialAIC, echo=FALSE}
# prep for expo
# exponential parametric model
# create vector of AIC values for expo
expAIC <- rep(0.0, 9)
## include all parameters (diy backselection)
expreg_all <- survreg(SurvObj ~ race + poverty + smoke + alcohol + agemth + ybirth + yschool + pc3mth, bfeed, dist="exponential")
expAIC[1] <- AIC(expreg_all)
# summary(expreg_all) # has highest p-val: pc3mth
expreg_less1 <- survreg(SurvObj ~ race + poverty + smoke + alcohol + agemth + ybirth + yschool, bfeed, dist="exponential")
expAIC[2] <- AIC(expreg_less1)
# summary(expreg_less1) # has highest p-val: agemth
expreg_less2 <- survreg(SurvObj ~ race +
poverty + smoke +
alcohol + yschool +
ybirth, bfeed, dist="exponential")
expAIC[3] <- AIC(expreg_less2)
# summary(expreg_less2) # has highest p-val: alcohol
expreg_less3 <- survreg(SurvObj ~ race +
poverty + smoke + yschool + ybirth,
bfeed, dist="exponential")
expAIC[4] <- AIC(expreg_less3)
# summary(expreg_less3) # has highest p-val: poverty
expreg_less4 <- survreg(SurvObj ~ race + smoke + yschool + ybirth,
bfeed, dist="exponential")
expAIC[5] <- AIC(expreg_less4)
# summary(expreg_less4) # has highest p-val: race
expreg_less5 <- survreg(SurvObj ~ smoke + yschool + ybirth,
bfeed, dist="exponential")
expAIC[6] <- AIC(expreg_less5)
# summary(expreg_less5) # has highest p-val: smoke
expreg_less6 <- survreg(SurvObj ~ yschool + ybirth,
bfeed, dist="exponential")
expAIC[7] <- AIC(expreg_less6)
# summary(expreg_less6) # has highest p-val: yschool
expreg_less7 <- survreg(SurvObj ~ ybirth,
bfeed, dist="exponential")
expAIC[8] <- AIC(expreg_less7)
# summary(expreg_less7)
expreg_less8 <- survreg(SurvObj ~ 1,
bfeed, dist="exponential")
expAIC[9] <- AIC(expreg_less8)
# summary(expreg_less8)
```
```{r exponential, echo=FALSE, fig.cap="Comparing KM curve for duration of breastfeeding according to mothers' race with exponential regression model."}
#figure 4
# predicted curves based on exponential best model
pred.exp1 = predict(expreg_less4, newdata=list(race="white", smoke = "no", yschool=12, ybirth=1982),type="quantile",p=seq(.01,.99,by=.01))
pred.exp2 = predict(expreg_less4, newdata=list(race="black", smoke = "no", yschool=12, ybirth=1982),type="quantile",p=seq(.01,.99,by=.01))
pred.exp3 = predict(expreg_less4, newdata=list(race="other", smoke = "no", yschool=12, ybirth=1982),type="quantile",p=seq(.01,.99,by=.01))
# put preds together in tidy df
df = data.frame(y=seq(99,1,by=-1), race_white=pred.exp1, race_black=pred.exp2, race_other = pred.exp3)
df_long = gather(df, key= "race", value="time", -y)
#km fit race
km.by.race <- survfit(SurvObj ~ race, data = bfeed)
#KM curve according to race
gg.km.by.race.exp <- ggsurvplot(
km.by.race, # survfit object with calculated statistics.
data = bfeed, # data used to fit survival curves.
# risk.table = TRUE, # show risk table.
pval = TRUE, # show p-value of log-rank test.
pval.coord = c(120,80), # location of pval
pval.method = TRUE, # show type of pval shown
conf.int = TRUE, # show confidence intervals for
# point estimates of survival curves.
xlim = c(0,200), # present narrower X axis, but not affect
# survival estimates.
xlab = "Time in weeks", # customize X axis label.
break.time.by = 20, # break X axis in time intervals by 500.
ggtheme = theme_minimal(),# customize plot and risk table with a theme.
risk.table.y.text.col = T, # colour risk table text annotations.
risk.table.y.text = FALSE, # show bars instead of names in text annotations
# in legend of risk table
# palette = "uchicago", # change colors to be pretty
log.rank.weights = "1", # Pval using log-rank test
pval.method.coord = c(120,90), # location of p-value text
pval.method.size = 3, # p-val text size
# title = "KM Curve - Race",
fun = "pct" #show survival function as percentage
)
# KM curve plot for race adding survreg predictions
gg.km.by.race.exp <- gg.km.by.race.exp$plot + geom_line(data=df_long, aes(x=time, y=y, group=race))
gg.km.by.race.exp
```
#### Goodness of fit \
Parametric models such as the exponential model allow for multivariate analysis. Given this, parameter selection is needed to find the best combination of parameters to predict or characterize survival. A backwards selection approach is utilized here to select parameters, in which all predictor variables were included in the model, the Akaike Information Criterion (AIC) for the model was recorded, and then the least significant parameter was dropped from the model until no parameters remain. The results of this process are in Table 4 below. This approach selected a model with four variables: race, smoking, years of education, and year child was born.
```{r, echo=FALSE}
#Table 4
summary_expAIC <- tibble(
"# Parameters" = c(8:0),
"AIC" = expAIC,
"Equation" = c(
"SurvObj ~ ybirth + yschool + ... + agemth + pc3mth",
"SurvObj ~ ybirth + yschool + ... + alcohol + agemth",
"SurvObj ~ ybirth + yschool + ... + poverty + alcohol",
"SurvObj ~ ybirth + yschool + smoke + race + poverty",
"SurvObj ~ ybirth + yschool + smoke + race", # lowest AIC
"SurvObj ~ ybirth + yschool + smoke",
"SurvObj ~ ybirth + yschool",
"SurvObj ~ ybirth",
"SurvObj ~ 1"
)
)
summary_expAIC %>%
knitr::kable(digits=5,caption="AIC of Backward selection of exponential survival model")
```
#### Parameter Estimates \
The parameter estimates for the best fit exponential model are shown in Table 5. We see p-values below the 0.05 threshold for all parameters except for where race of the mother is black has a p-value of 0.0928.
```{r, echo=FALSE}
#Table 5
summary_expreg <- summary(expreg_less4)
best.exp.model <- expreg_less4
summary_expreg$table %>%
knitr::kable(digits=4,caption = "Exponential Model: SurvObj ~ race + smoke + yschool + ybirth")#,caption="Exponential survival model with lowest AIC")
```
The AFT interpretation of this model suggests that black mothers average duration of breast feeding may only be $e^{-0.173}$ = `r toString(round(exp(-.173),2))` of that of white mothers (p-value = 0.093). For mothers whose race is classified as "other", the average duration of breast feeding is $e^{-0.310}$ = `r toString(round(exp(-.310),2))` of white mothers. Mothers who reported smoking at time of child birth had `r toString(round(exp(-.267),2))` shorter breast feeding duration compared to non-smokers. On the other hand, every year of additional education the mother had attained corresponded to a $e^{0.05}$ = `r toString(round(exp(.05),2))`. In other words, mothers showed a 5% increase in average duration of breast feeding for each additional year of schooling they completed, holding all other parameters constant. There also appeared to be a significant trend of child birth year effecting the duration of breastfeeding. In this model, each year later that the child was born resulted in reduction by a factor of `r toString(round(exp(-.0794),2))` in average breast feeding time.
As mentioned above, the exponential model has a unique characteristic that it can be described as an AFT model or as a PH model. By dividing the negative of the coefficients of the AFT model in Table 5 by its scale parameter (1 in the case of the exponential model), one can find the proportional hazards for each variable, shown in Table S?.
### Weibull
The Weibull model is a generalization of the exponential model that is widely used in survival analysis. The hazard and survival functions for this model is
$$
h(t) = \alpha \lambda t^{\alpha -1}
$$
and
$$
S(t) = e^{-\lambda t^\alpha}
$$
where $h(t)$ is the hazard function and $S(t)$ is the survival function. The shape parameter $\alpha$ can be thought of as a baseline log-hazard, while $\lambda$ is the rate parameter as in the exponential distribution. These are the AFT parameterizations of the Weibull model. Note that the the hazard is monotonic for this model. When $\alpha=1$, the Weibull model is the exponential model.
#### Assumptions of the Weibull survival model
One assumption of the Weibull survival model is that the the hazard rate is monotonic. Additionally, when working as an AFT model, the differences between groups should be able to be represented by only an acceleration in aging by a constant. Part of this is that KM curves should not cross, an assumption that is reasonably held in this data set. Figure 5 shows the Weibull model provides a fit to the data comparable to that of the exponential model above. Trends between the different racial groups correspond to regression outputs below showing similar survival for these three groups.
```{r WeibullAIC, echo=FALSE}
#prep Weibull
# Weibull parametric model
# create vector of AIC values for Weibull
WeibullAIC <- rep(0.0, 9)
## include all parameters (diy backselection)
Weibullreg_all <- survreg(SurvObj ~ race + poverty + smoke + alcohol + agemth + ybirth +
yschool + pc3mth, bfeed, dist="weibull")
WeibullAIC[1] <- AIC(Weibullreg_all)
# summary(Weibullreg_all) # has highest p-val: pc3mth
Weibullreg_less1 <- survreg(SurvObj ~ race + poverty + smoke + alcohol + agemth + ybirth +
yschool, bfeed, dist="weibull")
WeibullAIC[2] <- AIC(Weibullreg_less1)
# summary(Weibullreg_less1) # has highest p-val: agemth
Weibullreg_less2 <- survreg(SurvObj ~ race + poverty + smoke + alcohol + yschool + ybirth,
bfeed, dist="weibull")
WeibullAIC[3] <- AIC(Weibullreg_less2)
# summary(Weibullreg_less2) # has highest p-val: alcohol
Weibullreg_less3 <- survreg(SurvObj ~ race + poverty + smoke + yschool + ybirth,
bfeed, dist="weibull")
WeibullAIC[4] <- AIC(Weibullreg_less3)
# summary(Weibullreg_less3) # has highest p-val: poverty
Weibullreg_less4 <- survreg(SurvObj ~ race + smoke + yschool + ybirth,
bfeed, dist="weibull")
WeibullAIC[5] <- AIC(Weibullreg_less4)
# summary(Weibullreg_less4) # has highest p-val: race
Weibullreg_less5 <- survreg(SurvObj ~ smoke + yschool + ybirth,
bfeed, dist="weibull")
WeibullAIC[6] <- AIC(Weibullreg_less5)
# summary(Weibullreg_less5) # has highest p-val: smoke
Weibullreg_less6 <- survreg(SurvObj ~ yschool + ybirth,
bfeed, dist="weibull")
WeibullAIC[7] <- AIC(Weibullreg_less6)
# summary(Weibullreg_less6) # has highest p-val: yschool
Weibullreg_less7 <- survreg(SurvObj ~ ybirth,
bfeed, dist="weibull")
WeibullAIC[8] <- AIC(Weibullreg_less7)
# summary(Weibullreg_less7)
Weibullreg_less8 <- survreg(SurvObj ~ 1,
bfeed, dist="weibull")
WeibullAIC[9] <- AIC(Weibullreg_less8)
# summary(Weibullreg_less8)
```
```{r Weibull, echo=FALSE, fig.cap="Comparing KM curve for duration of breastfeeding according to mothers' race with the Weibull regression model."}
# figure 5
# predicted curves based on exponential best model
pred.W1 = predict(Weibullreg_less4, newdata=list(race="white", smoke = "no", yschool=12, ybirth=1982),type="quantile",p=seq(.01,.99,by=.01))
pred.W2 = predict(Weibullreg_less4, newdata=list(race="black", smoke = "no", yschool=12, ybirth=1982),type="quantile",p=seq(.01,.99,by=.01))
pred.W3 = predict(Weibullreg_less4, newdata=list(race="other", smoke = "no", yschool=12, ybirth=1982),type="quantile",p=seq(.01,.99,by=.01))
# put preds together in tidy df
df = data.frame(y=seq(99,1,by=-1), race_white=pred.W1, race_black=pred.W2, race_other = pred.W3)
df_long = gather(df, key= "race", value="time", -y)
#km fit race
km.by.race <- survfit(SurvObj ~ race, data = bfeed)
#KM curve according to race
gg.km.by.race.Weibull <- ggsurvplot(
km.by.race, # survfit object with calculated statistics.
data = bfeed, # data used to fit survival curves.
# risk.table = TRUE, # show risk table.
pval = TRUE, # show p-value of log-rank test.
pval.coord = c(120,80), # location of pval
pval.method = TRUE, # show type of pval shown
conf.int = TRUE, # show confidence intervals for
# point estimates of survival curves.
xlim = c(0,200), # present narrower X axis, but not affect
# survival estimates.
xlab = "Time in weeks", # customize X axis label.
break.time.by = 20, # break X axis in time intervals by 500.
ggtheme = theme_minimal(),# customize plot and risk table with a theme.
risk.table.y.text.col = T, # colour risk table text annotations.
risk.table.y.text = FALSE, # show bars instead of names in text annotations
# in legend of risk table
# palette = "uchicago", # change colors to be pretty
log.rank.weights = "1", # Pval using log-rank test
pval.method.coord = c(120,90), # location of p-value text
pval.method.size = 3, # p-val text size
# title = "KM Curve - Race",
fun = "pct" #show survival function as percentage
)
# KM curve plot for race adding survreg predictions
gg.km.by.race.Weibull$plot <- gg.km.by.race.Weibull$plot + geom_line(data=df_long, aes(x=time, y=y, group=race))
gg.km.by.race.Weibull
```
#### Goodness of fit
The same backwards selection approach is utilized here as the exponential model to select parameters and compare AIC scores. The results of this process are in Table 6 below. The Weibull model also selected the model with the same four variables: race, smoking, years of education, and year child was born.
```{r, echo=FALSE}
# Table 6
summary_WeibullAIC <- tibble(
"# Parameters" = c(8:0),
"AIC" = WeibullAIC,
"Equation" = c(
"SurvObj ~ ybirth + yschool + ... + agemth + pc3mth",
"SurvObj ~ ybirth + yschool + ... + alcohol + agemth",
"SurvObj ~ ybirth + yschool + ... + poverty + alcohol",
"SurvObj ~ ybirth + yschool + smoke + race + poverty",
"SurvObj ~ ybirth + yschool + smoke + race", # lowest AIC
"SurvObj ~ ybirth + yschool + smoke",
"SurvObj ~ ybirth + yschool",
"SurvObj ~ ybirth",
"SurvObj ~ 1"
)
)
summary_WeibullAIC %>%
knitr::kable(digits=5,caption="AIC of Backward selection of Weibull survival model")
```
#### Parameter Estimates
The parameter estimates for the best fit Weibull model are shown in Table 7. We see p-values below the 0.05 threshold for all parameters except for where race of the mother is black has a p-value of 0.0936.
```{r, echo=FALSE}
# Table7
summary_Weibullreg <- summary(Weibullreg_less4)
best.Weibull.model <- Weibullreg_less4
summary_Weibullreg$table %>%
knitr::kable(digits=4,caption = "Weibull Model: SurvObj ~ race + smoke + yschool + ybirth")#,caption="Exponential survival model with lowest AIC")
```
The parameter estimates for the Weibull model are extremely similar to the parameters of the exponential distribution, which one would expect given the log(scale) parameter for the fit is very close to zero corresponding to $\alpha$ value very close to one. The Weibull model can also be described as an AFT model or as a PH model. Given these values are very similar to those of the exponential model, these values have not been included in this analysis.
The finding that the generalization of the exponential distribution, the Weibull, closely matches the results of the exponential itself supports that the constant hazard assumption of the exponential is close to the optimal monotonic hazard function to fit the data.
### Lognormal
The lognormal model is useful for modeling data with a hump-shaped hazard curve. This shape of hazard curve is not the most common; however, the lognormal model is a very useful model for such scenarios. The hazard and survival functions for $X \sim lognormal(\mu,\sigma)$ the lognormal model are
$$
S(t) = Pr(T > t) = Pr(ln(X) > ln(X)) = 1 - \Phi(\frac{ln(x) - \mu}{\sigma})
$$
and
$$
h(t) = \frac{(\frac{1}{x\sigma})\phi(\frac{ln(x)}{\sigma})}{\Phi(\frac{-ln(x)}{\sigma})}
$$
with $x > 0, \sigma > 0$. Here $h(t)$ the hazard function and $S(t)$ the survival function. Here $\phi$ represents the probability density function of the normal distribution while $\Phi$ is the cumulative distribution function of the normal distribution. The distribution of $ln(X) \sim N(\mu,\sigma)$ with mean $\mu$ and standard deviation $\sigma$.
#### Assumptions of the lognormal survival model \
The lognormal survival model assumes that the time-to-event variable is lognormally distributed. The duration variable of the breastfeeding data set is fairly well represented by a normal distribution following log-transformation (Supplementary Figure S2). As an AFT model, differences between groups should be able to be modeled as accelerated time. Figure 5 shows the lognormal model provides a fit to the data qualitatively similar to the other parametric models above. Trends between the different racial groups correspond to regression outputs below showing similar survival for these three groups.
```{r, echo=FALSE}
#prep for lognorm
# lognormal parametric model
# create vector of AIC values for lognormal
lognormalAIC <- rep(0.0, 9)
## include all parameters (diy backselection)
lognormalreg_all <- survreg(SurvObj ~ race + poverty + smoke + alcohol + agemth + ybirth +
yschool + pc3mth, bfeed, dist="lognormal")
lognormalAIC[1] <- AIC(lognormalreg_all)
# summary(lognormalreg_all) # has highest p-val: pc3mth
lognormalreg_less1 <- survreg(SurvObj ~ race + poverty + smoke + alcohol + agemth + ybirth +
yschool, bfeed, dist="lognormal")
lognormalAIC[2] <- AIC(lognormalreg_less1)
# summary(lognormalreg_less1) # has highest p-val: agemth
lognormalreg_less2 <- survreg(SurvObj ~ race + poverty + smoke + alcohol + yschool + ybirth,
bfeed, dist="lognormal")
lognormalAIC[3] <- AIC(lognormalreg_less2)
# summary(lognormalreg_less2) # has highest p-val: alcohol
lognormalreg_less3 <- survreg(SurvObj ~ race + poverty + smoke + yschool + ybirth,
bfeed, dist="lognormal")
lognormalAIC[4] <- AIC(lognormalreg_less3)
# summary(lognormalreg_less3) # has highest p-val: poverty
lognormalreg_less4 <- survreg(SurvObj ~ race + smoke + yschool + ybirth,
bfeed, dist="lognormal")
lognormalAIC[5] <- AIC(lognormalreg_less4)
# summary(lognormalreg_less4) # has highest p-val: race
lognormalreg_less5 <- survreg(SurvObj ~ smoke + yschool + ybirth,
bfeed, dist="lognormal")
lognormalAIC[6] <- AIC(lognormalreg_less5)
# summary(lognormalreg_less5) # has highest p-val: smoke
lognormalreg_less6 <- survreg(SurvObj ~ yschool + ybirth,
bfeed, dist="lognormal")
lognormalAIC[7] <- AIC(lognormalreg_less6)
# summary(lognormalreg_less6) # has highest p-val: yschool
lognormalreg_less7 <- survreg(SurvObj ~ ybirth,
bfeed, dist="lognormal")
lognormalAIC[8] <- AIC(lognormalreg_less7)
# summary(lognormalreg_less7)
lognormalreg_less8 <- survreg(SurvObj ~ 1,
bfeed, dist="lognormal")
lognormalAIC[9] <- AIC(lognormalreg_less8)
# summary(lognormalreg_less8)
```
```{r lognormal, echo=FALSE, fig.cap="Comparing KM curve for duration of breastfeeding according to mothers' race with the Weibull regression model."}
#fig6
# predicted curves based on exponential best model
pred.lognormal1 = predict(lognormalreg_less4, newdata=list(race="white", smoke = "no", yschool=12, ybirth=1982),type="quantile",p=seq(.01,.99,by=.01))
pred.lognormal2 = predict(lognormalreg_less4, newdata=list(race="black", smoke = "no", yschool=12, ybirth=1982),type="quantile",p=seq(.01,.99,by=.01))
pred.lognormal3 = predict(lognormalreg_less4, newdata=list(race="other", smoke = "no", yschool=12, ybirth=1982),type="quantile",p=seq(.01,.99,by=.01))
# put preds together in tidy df
df = data.frame(y=seq(99,1,by=-1), race_white=pred.lognormal1, race_black=pred.lognormal2, race_other = pred.lognormal3)
df_long = gather(df, key= "race", value="time", -y)
#km fit race
km.by.race <- survfit(SurvObj ~ race, data = bfeed)
#KM curve according to race
gg.km.by.race.lognormal <- ggsurvplot(
km.by.race, # survfit object with calculated statistics.
data = bfeed, # data used to fit survival curves.
# risk.table = TRUE, # show risk table.
pval = TRUE, # show p-value of log-rank test.
pval.coord = c(120,80), # location of pval
pval.method = TRUE, # show type of pval shown
conf.int = TRUE, # show confidence intervals for
# point estimates of survival curves.
xlim = c(0,200), # present narrower X axis, but not affect
# survival estimates.
xlab = "Time in weeks", # customize X axis label.
break.time.by = 20, # break X axis in time intervals by 500.
ggtheme = theme_minimal(),# customize plot and risk table with a theme.
risk.table.y.text.col = T, # colour risk table text annotations.
risk.table.y.text = FALSE, # show bars instead of names in text annotations
# in legend of risk table
# palette = "uchicago", # change colors to be pretty
log.rank.weights = "1", # Pval using log-rank test
pval.method.coord = c(120,90), # location of p-value text
pval.method.size = 3, # p-val text size
# title = "KM Curve - Race",
fun = "pct" #show survival function as percentage
)
# KM curve plot for race adding survreg predictions
gg.km.by.race.lognormal$plot <- gg.km.by.race.lognormal$plot + geom_line(data=df_long, aes(x=time, y=y, group=race))
gg.km.by.race.lognormal
```
#### Goodness of fit \
The same backwards selection approach is utilized here as the above models to select parameters and compare AIC scores. The results of this process are in Table 8 below. The lognormal model with the lowest AIC score had the same four variables: race, smoking, years of education, and year child was born.
```{r, echo=FALSE}
#table8
summary_lognormalAIC <- tibble(
"# Parameters" = c(8:0),
"AIC" = lognormalAIC,
"Equation" = c(
"SurvObj ~ ybirth + yschool + ... + agemth + pc3mth",
"SurvObj ~ ybirth + yschool + ... + alcohol + agemth",
"SurvObj ~ ybirth + yschool + ... + poverty + alcohol",
"SurvObj ~ ybirth + yschool + smoke + race + poverty",
"SurvObj ~ ybirth + yschool + smoke + race", # lowest AIC
"SurvObj ~ ybirth + yschool + smoke",
"SurvObj ~ ybirth + yschool",
"SurvObj ~ ybirth",
"SurvObj ~ 1"
)
)
summary_lognormalAIC %>%
knitr::kable(digits=5,caption="AIC of Backward selection of lognormal survival model")
```
#### Parameter Estimates \
The parameter estimates for the best fit lognormal model are shown in Table 9. We see p-values below the 0.05 threshold for all parameters except for where race of the mother is black has a p-value of 0.2062.
```{r, echo=FALSE}
#table9
summary_lognormalreg <- summary(lognormalreg_less4)
best.lognormal.model <- lognormalreg_less4
summary_lognormalreg$table %>%
knitr::kable(digits=4,caption = "Lognormal Model: SurvObj ~ race + smoke + yschool + ybirth")#,caption="Exponential survival model with lowest AIC")
```
The AFT interpretation of this lognormal model suggests that black mothers average duration of breast feeding may only be $e^{-0.1485}$ = `r toString(round(exp(-0.1485),2))` of that of white mothers, less of a difference than predicted by the previous models. For mothers whose race is classified as "other", the average duration of breast feeding is $e^{-0.2717}$ = `r toString(round(exp(-0.2717),2))` of white mothers, slightly less than the difference predicted from the above models. Mothers who reported smoking at time of child birth had `r toString(round(exp(-0.2252),2))` shorter breast feeding duration compared to non-smokers according to this model. On the other hand, every year of additional education the mother had attained corresponded to a $e^{0.0884}$ = `r toString(round(exp(0.0884),2))`. In other words, mothers showed around a **9%** increase in average duration of breast feeding for each additional year of schooling they completed. There also appeared to be a significant trend of child birth year effecting the duration of breastfeeding. In this model, each year later that the child was born resulted in reduction by a factor of `r toString(round(exp(-.0735),2))` in average breast feeding time. Unlike the exponential and Weibull models above, the exponential model can only work as a AFT model.
#### Comparison of Parametric Models \
The results of the different parametric models converged on similar results for the best group of predictors and coefficients for those predictors. While any of the models would be reasonable approximations for modeling, it is possible to compare AIC values for all of the models to see which model is optimal. In Table 10 it can be seen that the model with the lowest AIC value among those tested is the lognormal model utilizing the year of birth, years of mother education, smoking, and race to predict duration of breastfeeding.
```{r, echo=FALSE}
#table10
lognormalAIC <- round(lognormalAIC,0)
lognormalAIC[5] <- paste("\\color{red}{",round(lognormalAIC[5], 2), "}")
AIC_all <- tibble(
"# Parameters" = c(8:0),
"Exponential" = round(expAIC,0),
"Weibull" = round(WeibullAIC,0),
"Lognormal" = (lognormalAIC),
"Equation" = c(
"SurvObj ~ ybirth + yschool + ... + agemth + pc3mth",
"SurvObj ~ ybirth + yschool + ... + alcohol + agemth",
"SurvObj ~ ybirth + yschool + ... + poverty + alcohol",
"SurvObj ~ ybirth + yschool + smoke + race + poverty",
"SurvObj ~ ybirth + yschool + smoke + race", # lowest AIC
"SurvObj ~ ybirth + yschool + smoke",
"SurvObj ~ ybirth + yschool",
"SurvObj ~ ybirth",
"SurvObj ~ 1"
)
)
AIC_all %>%
knitr::kable(digits=4,
caption = "AIC values across parametric models")
```
While this model is ideal, we can see in Table S3 that the parameter estimates across the 3 models are highly similar. In theory, these models have the benefit of allowing for predicting survival beyond the times shown in this study. In the context of breastfeeding; however, the time of greatest interest is the first 6 months of life so this aspect of parametric models is not as useful in this scenario. Furthermore, the baseline hazard functions are by definition parametric in these models, so if the data do not closely match any of these distributions, a non-parametric baseline hazard functions would better elucidate patterns in the data.
## Cox Proportional Hazards Model
In contrast to the parametric models above, the Cox PH model is semi-parametric. The baseline hazard of the model is nonparametric. While the parametric models above fit the data reasonably well, it is worth comparing their findings to that of perhaps the gold standard of time-to-event analysis: the Cox Proportional Hazards model, specifically this model will be the Cox PH model of time-independent variables. The general hazard function for p predictors is
$$
h(t, X, \beta) = h_0(t)\times e^{\beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p}
$$
where $h(t)$ is the hazard function, $h_0(t)$ is the non-parametric baseline hazard function, $beta_i$ is the coefficients for the $i^{th}$ predictor $X_i$. In the context of this dataset, $X_1$ could be the smoking variable, where $x_1 = 0$ is nonsmoking and $x_1=1$ is smoking, the effect of this variable on the hazard function would then be determined by $beta_1$. When working with proportional hazards, the hazard ratio (HR) is a useful tool. The hazard ratio is defined as
$$
HR(t, x_1,x_2) = \frac{h(t,X=1,\beta)}{h(t,X=0,\beta)} = \frac{h_0(t)\Gamma(X=1,\beta)}{h_0(t)\Gamma(X=0,\beta)} = \frac{\Gamma(X=1,\beta)}{\Gamma(X=0,\beta)} = e^{\beta}
$$