-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMonica_Analysis.rmd
628 lines (471 loc) · 27.4 KB
/
Monica_Analysis.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
---
output:
html_document: default
pdf_document: default
---
## The MONICA (Multinational MONItoring of trends and determinants in CArdiovascular disease) Analysis
By Soumia Zohra El Mestari
========================================================
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
suppressMessages(library(ISLR))
suppressMessages(library(ggthemes))
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
suppressMessages(library(DAAG))
suppressMessages(library(gridExtra))
suppressMessages(library(GGally))
suppressMessages(library(data.table))
```
```{r echo=FALSE, Load_the_Data,warning=FALSE,message=FALSE}
# Load the Data
# The data that I used are available in the DAAG library So if you installed the
#Library the following #intruction will load the data
data("monica")
# Please uncomment this section if you don't have the DAAG library installed
#(the dataset is joined )
#monica <- read.csv("monicaDataSet.csv")
```
The monica dataset (Multinational MONItoring of trends and determinants in CArdiovascular disease) is a WHO (World Health Organization) dataset, the main purpose behind this project was to monitor trends in cardiovascular diseases, and to relate these to risk factor changes in the population over a ten year period. It was set up to explain the diverse trends in cardiovascular disease mortality which were observed from the 1970s onwards.
The dataset contains 6357 rows and 12 following variables : <br>
**outcome** :mortality outcome, a factor with levels live, dead <br>
**age** : age at onset<br>
**sex** : m = male, f = female<br>
**hosp** : y = hospitalized, n = not hospitalized<br>
**yronset** :year of onset<br>
**premi** :previous myocardial infarction event, a factor with levels y, n, nk not known. <br>
**smstat** :smoking status, a factor with levels c current, x ex-smoker, n non-smoker, nk not known
**diabetes** :a factor with levels y, n, nk (not known) <br>
**highbp** : high blood pressure, a factor with levels y, n, nk (not known)<br>
**hichol** : high cholesterol, a factor with levels y, n nk (not known)<br>
**angina** : a factor with levels y, n, nk angina is described by the American Heart Association as, “…chest pain or discomfort caused when your heart muscle doesn’t get enough oxygen-rich blood”<br>
**stroke** : is the patient had or hadn't a stroke a factor with levels y, n, nk (not known) <br>
# Univariate Plots Section
**An overview about the variables**
```{r echo=FALSE, Univariate_Plots1,warning=FALSE, include=FALSE,message=FALSE }
summary(monica)
```
**1st Exploration** : Let's explore the ages of people in this dataset
```{r echo=FALSE, Univariate_Plots2,warning=FALSE,message=FALSE}
ggplot(aes(x = age),data = monica) +
geom_histogram(binwidth = 1,col = "black",fill = "sienna1") +
xlim(c(35,69)) +
ggtitle("The distribution of indivuals ages in this dataset") +
scale_x_sqrt()
```
> People in this dataset are all adults ranging from 35 to 69 years old ,with the majority of them beyond 60 years old , (statistically the distribution is negatively skewed)<br>
**2nd Exploration** let's explore the number of the mortality status <br>
```{r echo=FALSE, Univariate_Plots3,warning=FALSE,message=FALSE}
ggplot(aes(x = outcome),data = monica) +
geom_bar(fill = c("yellowgreen","grey")) +
ggtitle("The mortality status ")
```
> The number of death cases is less than the live cases .<br>
**3rd exploration** : sex distribution in our dataset
```{r echo=FALSE, Univariate_Plots4,warning=FALSE,message=FALSE }
ggplot(aes(x = sex),data = monica) +
geom_bar(fill = c("blue","pink")) +
scale_x_discrete(labels = c("Males", "Females")) +
ggtitle("The sex status ")
```
> It's to note that we have more males than females in this dataset so maybe the sex can influence the outcome (we will explore this possibility latter <br>
**4th exploration** : hospitalized vs not hospitalized cases .
```{r echo=FALSE, Univariate_Plots5,warning=FALSE,message=FALSE }
ggplot(aes(x = hosp),data = monica) +
geom_bar(fill = c("orange","yellow")) +
scale_x_discrete(labels = c("Hospitalised", "Not Hospitalised")) +
ggtitle("Hospitalized Vs not hospitalized individuals ")
```
> Most people were hospitalized in this dataset <br>
**5th variable** : smoking status <br>
```{r echo=FALSE, Univariate_Plots6,warning=FALSE,message=FALSE }
ggplot(aes(x = smstat),data = monica) +
geom_bar(fill = c("wheat4","tomato2","olivedrab3","grey45")) +
scale_x_discrete(labels = c("Current\nSmokers", "ex-smokers","non-smokers"
,"not known")
) +
ggtitle("smoking status ")
```
> The most people here had a smoking experience either a pervious or a current one.
**6th exploring medical status ** : In the Following I will explore the ratios of diffrent diseases and events : previous myocardial infarction event , diabetes , high blood pressure , high cholesterol , angina , and the stroke rates .
```{r echo=FALSE, Univariate_Plots7,warning=FALSE,message=FALSE }
premi_plot <- ggplot(aes(x = premi) , data = monica) +
geom_bar(fill = c("blue","tan","dodgerblue1")) +
scale_x_discrete(labels = c("Had premi", "hadn't Premi","Not known")) +
ggtitle("Previous myocardial infarction event rate")
diabetes_plot <- ggplot(aes(x = diabetes) , data = monica) +
geom_bar(fill = c("maroon","maroon2","maroon4")) +
scale_x_discrete(labels = c("Have diabetes", "Haven't diabetes"
,"Not known")
) +
ggtitle("Diabetes rates")
highbp_plot <- ggplot(aes(x = highbp) , data = monica) +
geom_bar(fill = c("sienna1","sienna4","sienna2")) +
scale_x_discrete(labels = c("Have High \n blood pressure",
"Haven't High \n blood pressure","Not known")
) +
ggtitle("High blood pressure status")
hichol_plot <- ggplot(aes(x = hichol) , data = monica) +
geom_bar(fill = c("orange4","orange3","orange2")) +
scale_x_discrete(labels = c("Have High \n cholesterol",
"Haven't High \n cholesterol","Not known")
) +
ggtitle("High Cholesterol status ")
angina_plot <- ggplot(aes(x = angina) , data = monica) +
geom_bar(fill = c("chocolate4","chocolate2","chocolate1")) +
scale_x_discrete(labels = c("Angina", "Not angina","Not known")) +
ggtitle("angina status ")
stroke_plot <- ggplot(aes(x = stroke) , data = monica) +
geom_bar(fill = c("red4","red2","red1")) +
scale_x_discrete(labels = c("Stroke", "No stroke","Not known")) +
ggtitle("stroke status ")
grid.arrange(premi_plot,diabetes_plot,highbp_plot,hichol_plot,angina_plot,
stroke_plot ,ncol = 2)
```
> Generally speaking we have a high rate of high blood pressure in our population while diabetes and stroke rates are low .
# Univariate Analysis
### What is the structure of your dataset?
The dataset contains 6357 records or cases with 12 variables (outcome,age,sex,hosp,yronset,premi,smstat,highbp,hichol,angina,stroke)<br>
* The data collection was from 1989 to 1993 <br>
* All the idividuals were adults and the most of them were beyond 60 years old.<br>
* The number of males is roughly 4 times the number of females in this dataset. <br>
* Most of our cases had a smoking experience either previous or current <br>
* nearly 70% of the individuals were hospitalized <br>
### What is/are the main feature(s) of interest in your dataset?
The main features in this dataset are : the mortality rate ( outcome) , age . I want to determine the best features in predicting the life expectancy for cardiovascular patients.<br>
### What other features in the dataset do you think will help support your \
All the other features (stroke , angina ......) may influence the life expectancy. <br>
### Did you create any new variables from existing variables in the dataset?
No I haven't created any new variables <br>
### Of the features you investigated, were there any unusual distributions?
I scaled the age with sqrt to help determine and emphazise the most frequent age of these individuals.
# Bivariate Plots Section
> **1st life status over years** :
In this section we will explore the number of deaths and survival cases per years .<br>
```{r echo=FALSE, Bivariate_Plots1,warning=FALSE,message=FALSE }
df.per_year <- group_by(monica,monica$outcome,monica$yronset) %>%
summarise(n = n())
ggplot(aes(x = df.per_year$`monica$yronset`,y = n),data = df.per_year) +
geom_line(aes(color = df.per_year$`monica$outcome`)) +
xlab('Years') +
ylab('count') +
labs(color = 'Life status') +
ggtitle("the evolution life status of cardioVasculaires patients
From 1989 to 1993 ")
```
> We can See that in general the number of deaths tended to decrease which is a good indicator <br>
**2nd Outcome by age**
```{r echo=FALSE, Bivariate_Plots2,warning=FALSE,message=FALSE }
ggplot(aes(x = age),data = monica) +
geom_histogram(aes(fill = monica$outcome)) +
xlim(c(35,70)) +
xlab('Age') +
ylab('people count') +
labs(fill = 'Mortality outcome') +
theme_gdocs() +
ggtitle("Outcomes by age")
```
> There no general trend here but we can see that for individuals under 55 years old more people survived generally , and after 55 years old the the rate of deaths increases .
**3rd Smoking Vs outcome**
```{r echo=FALSE, Bivariate_Plots3,warning=FALSE,message=FALSE}
ggplot(monica,aes(smstat,group = outcome)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..))) +
scale_y_continuous(labels = scales::percent) +
ylab("Relative Frenquencies") +
xlab("Smoking status ") +
facet_grid(outcome ~ .) +
scale_x_discrete(labels = c("Current\nSmokers", "ex-smokers",
"non-smokers","not known")
) +
labs(title = "Patient Smoking Status",
subtitle = expression(paste(italic("by patient outcome")))) +
theme_economist()
```
> There is a large proportion of missing values (not known) in the deceased group, so with high level of missing data we are having a misrepresented group (deaths ) So this variable won't play a significant role in discovering trends. <br>
**4th High blood pressure Vs outcome**
```{r echo=FALSE, Bivariate_Plots4,warning=FALSE,message=FALSE}
ggplot(monica,aes(highbp,group = outcome)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..))) +
scale_y_continuous(labels = scales::percent) +
ylab("Relative Frenquencies") +
xlab("High blood Pressure ") +
facet_grid(outcome ~ .) +
scale_x_discrete(labels = c("Have High \n blood pressure",
"Haven't High \n blood pressure","Not known")
) +
labs(title = "Patient's high blood pressure Status",
subtitle = expression(paste(italic("by patient outcome")))) +
theme_economist()
```
> First of all just like with the smoking status we have a huge portion of missing data in of the groups (deceased group) this make it difficult to say whether the high blood pressure plays a significant role here or not , So this variable needs to be investigated in more depth.
**5th Angina Vs outcome**
```{r echo=FALSE, Bivariate_Plots5,warning=FALSE,message=FALSE}
ggplot(monica,aes(angina,group = outcome)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..))) +
scale_y_continuous(labels = scales::percent) +
ylab("Relative Frenquencies") +
xlab("Angina ") +
facet_grid(outcome ~ .) +
scale_x_discrete(labels = c("Angina", "Not angina","Not known")) +
labs(title = "Patient's Angina status ",
subtitle = expression(paste(italic("by patient outcome")))) +
theme_economist()
```
> in the living Group those who hadn't Angina survived more , So this variable maybe useful . It's to note that like the previous variables the
deceased group suffers from a huge miss of data.
**6th Cholesterol Vs outcome**
```{r echo=FALSE, Bivariate_Plots6,warning=FALSE,message=FALSE}
ggplot(monica,aes(hichol,group = outcome)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..))) +
scale_y_continuous(labels = scales::percent) +
ylab("Relative Frenquencies") +
xlab("Cholesterol ") +
facet_grid(outcome ~ .) +
scale_x_discrete(labels = c("Have High \n cholesterol",
"Haven't High \n cholesterol","Not known")
) +
labs(title = "Patient's Cholesterol status ",
subtitle = expression(paste(italic("by patient outcome")))) +
theme_economist()
```
> We can see that people not having High cholesterol status are heavily represeted in both groups Plus having a big portion of missing data in the deaths group So this makes it hard to tell whether this variable is influencing or not .
**7 Diabetes Vs outcome**
```{r echo=FALSE, Bivariate_Plots7,warning=FALSE,message=FALSE}
ggplot(monica,aes(diabetes,group = outcome)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..))) +
scale_y_continuous(labels = scales::percent) +
ylab("Relative Frenquencies") +
xlab("Diabetes ") +
facet_grid(outcome ~ .) +
scale_x_discrete(labels = c("Have diabetes", "Haven't diabetes",
"Not known")
) +
labs(title = "Patient's Diabetes status ",
subtitle = expression(paste(italic("by patient outcome")))) +
theme_economist()
```
> In both groups those who have not diabetes are heavily represented Plus having a good portion of missing data in the deaths group , This variable may not be significant in discovering trends
**8th Stroke Vs outcome**
```{r echo=FALSE, Bivariate_Plots8,warning=FALSE,message=FALSE}
ggplot(monica,aes(stroke,group = outcome)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..))) +
scale_y_continuous(labels = scales::percent) +
ylab("Relative Frenquencies") +
xlab("Stroke ") +
facet_grid(outcome ~ .) +
scale_x_discrete(labels = c("Have stroke",
"Haven't stroke","Not known")
) +
labs(title = "Patient's stroke status ",
subtitle = expression(paste(italic("by patient outcome")))) +
theme_economist()
```
> The proportion of survivors who had a stroke is roughly similar to those who died .With as a huge missing data points in the deaths group.<br>
**9th previous myocardial infarction event Vs outcome**
```{r echo=FALSE, Bivariate_Plots9,warning=FALSE,message=FALSE}
ggplot(monica,aes(premi,group = outcome)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..))) +
scale_y_continuous(labels = scales::percent) +
ylab("Relative Frenquencies") +
xlab("previous myocardial infarction event ") +
facet_grid(outcome ~ .) +
scale_x_discrete(labels = c("Had premi", "hadn't Premi","Not known")) +
labs(title = "Patient's previous myocardial infarction event status ",
subtitle = expression(paste(italic("by patient outcome")))) +
theme_economist()
```
> the biggest portion of survivors hadn't previous myocardial infarction event , when it comes to the deaths group it hard to tell because we have a good portion of missing data .
**10th Hospitalization Vs outcome**
```{r echo=FALSE, Bivariate_Plots10,warning=FALSE,message=FALSE}
ggplot(monica,aes(hosp,group = outcome)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..))) +
scale_y_continuous(labels = scales::percent) +
ylab("Relative Frenquencies") +
xlab("Hospitalization") +
facet_grid(outcome ~ .) +
scale_x_discrete(labels = c("Hospitalised", "Not Hospitalised")) +
labs(title = "Patient's Hospitalization status ",
subtitle = expression(paste(italic("by patient outcome")))) +
theme_economist()
```
> Here it can easily be seen that From those who didn't survive a large portion were hospitalized and also all the survivors were hospitalized, so this variable is significant .
**11th pairs correlations between variables**
```{r echo=FALSE, Bivariate_Plots11,warning=FALSE,message=FALSE}
ggpairs(monica ,columns = c(5:9),
lower = list(continuous = wrap("points", color = "red", alpha = 0.5),
combo = wrap("box", color = "orange", alpha = 0.3),
discrete = wrap("facetbar", color = "maroon", alpha = 0.3) ),
diag = list(continuous = wrap("densityDiag", color = "blue", alpha = 0.5)))
```
```{r echo=FALSE, Bivariate_Plots12,warning=FALSE,message=FALSE}
ggpairs(monica ,columns=c(2:8),
lower = list(continuous = wrap("points", color = "red", alpha = 0.5),
combo = wrap("box", color = "orange", alpha = 0.3),
discrete = wrap("facetbar", color = "maroon", alpha = 0.3)),
diag = list(continuous = wrap("densityDiag", color = "blue", alpha = 0.5)))
```
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the \
* The general trend of surving is increasing throughout the years and decreasing for the deaths which is a good sign .<br>
* There is a relationship between hospitalization and the outcome. <br>
* Angina may have a relashionship with the outcome but as the remain of variables the missing data makes it hard to tell <br>
### Did you observe any interesting relationships between the other features \
From the scatter plot matrix we can see :
- Possible positive relationships between High blood pressure and each of : diabetes , previous myocardial infarction event <br>
- More females have high Blood pressure .<br>
- More males have diabetes . <br>
- Males tend to have more smoking experiences ( either current or previous ) <br>
### What was the strongest relationship you found?
The hospitalization of cardiovascular patients plays a principal role in their survival , so the outcome is highly correlated with the hospitalization and the age of the patient.
The hospitalization of cardiovascular patients plays a principal role in their survival , so the outcome is highly correlated with the hospitalization .
# Multivariate Plots Section
**Deaths Vs hospitalization status** :
```{r echo=FALSE, Multivariate_Plots1,warning=FALSE,message=FALSE}
hops_vs_death <- subset(monica,monica$outcome != 'live')
hops_vs_death <- hops_vs_death %>%
group_by(hops_vs_death$hosp,hops_vs_death$yronset) %>%
summarise(n = n())
ggplot(aes(x = hops_vs_death$`hops_vs_death$yronset`,y = n),data = hops_vs_death) +
geom_point(aes(color = hops_vs_death$`hops_vs_death$hosp`)) +
xlab('Years') +
ylab('deaths count') +
labs(color = 'Hospitalization status') +
theme_stata() +
scale_colour_stata(scheme = "s1color") +
ggtitle("the evolution of deaths per hospitalization status from 1989
to 1993 ")
```
> It can be seen that throghout the years the deaths number decreased rapidly when the patient was hospitalized.
**Deaths Vs (Angina and hospitalization )**:
```{r echo=FALSE, Multivariate_Plots2,warning=FALSE,message=FALSE}
# get only the deaths
Hosp_angina_outcome <- subset(monica,monica$outcome != 'live')
# Group by hospitalization , years , angina status
Hosp_angina_outcome <- Hosp_angina_outcome %>%
group_by(Hosp_angina_outcome$hosp,Hosp_angina_outcome$yronset,
Hosp_angina_outcome$angina) %>%
summarise( n = n())
ggplot(aes(x = Hosp_angina_outcome$`Hosp_angina_outcome$yronset`,y = n),
data = Hosp_angina_outcome) +
geom_point(aes(color = Hosp_angina_outcome$`Hosp_angina_outcome$angina`:
Hosp_angina_outcome$`Hosp_angina_outcome$hosp` )) +
xlab('Years') +
ylab('deaths count') +
labs(color = 'Angina : hospitalization status ') +
scale_color_brewer(type = 'div', palette = 'Set1', direction = 1) +
ggtitle("
the evolution of deaths per hospitalization status from 1989 to 1993 ")
```
> So far not having Angina and being hospitalized released in less deaths per years , while having Angina and not being hospitalized released in more deaths.
**Let's explore the deaths for patients having high blood pressure and/or diabetes**:
```{r echo=FALSE, Multivariate_Plots3,warning=FALSE,message=FALSE}
# get only the deaths
Hbp_diabetes_outcome <- subset(monica,monica$outcome != 'live')
# Group by high blood pressure , years , diabetes
Hbp_diabetes_outcome <- Hbp_diabetes_outcome %>%
group_by(Hbp_diabetes_outcome$highbp,Hbp_diabetes_outcome$yronset,
Hbp_diabetes_outcome$diabetes) %>%
summarise( n = n())
ggplot(aes(x = Hbp_diabetes_outcome$`Hbp_diabetes_outcome$yronset`,y = n),
data = Hbp_diabetes_outcome) +
geom_point(aes(color = Hbp_diabetes_outcome$`Hbp_diabetes_outcome$highbp`:
Hbp_diabetes_outcome$`Hbp_diabetes_outcome$diabetes`)) +
xlab('Years') +
ylab('deaths count') +
labs(color = 'High blood pressure: diabetes status ') +
scale_color_brewer(type = 'div', palette = 9, direction = 1) +
labs(title = "Deaths from cardiovascular diseases \n",
subtitle = expression(
paste(italic(" By High blood pressure and diabetes status "))))
```
> It can be seen that having a high blood pressure only is influencing more on deaths than having diabetes and high blood pressure together.
**Let's explore the deaths for patients having high blood pressure and/or previous myocardial infarction event**:
```{r echo=FALSE, Multivariate_Plots4,warning=FALSE,message=FALSE}
# get only the deaths
Hbp_premi_outcome <- subset(monica,monica$outcome != 'live')
# Group by high blood pressure , years , premi
Hbp_premi_outcome <- Hbp_premi_outcome %>%
group_by(Hbp_premi_outcome$highbp,Hbp_premi_outcome$yronset,
Hbp_premi_outcome$premi) %>%
summarise( n = n())
ggplot(aes(x = Hbp_premi_outcome$`Hbp_premi_outcome$yronset`,y = n),
data = Hbp_premi_outcome) +
geom_point(aes(color = Hbp_premi_outcome$`Hbp_premi_outcome$highbp`:
Hbp_premi_outcome$`Hbp_premi_outcome$premi`)) +
xlab('Years') +
ylab('deaths count') +
labs(color = 'High blood pressure: Premi status ') +
scale_color_brewer(type = 'div', palette = 'Set2', direction = 1) +
labs(title = "Deaths from cardiovascular diseases ",
subtitle = expression(paste(italic(
"By High blood pressure and previous myocardial infarction event status"
))))
```
> It can be seen that having a high blood pressure only is influencing more on deaths than having Premi and high blood pressure together which interstingly scored less deaths than having none of them ! So there must be a stronger influencer on deaths ( hospitalization status ).
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the \
The hospitazation plays a key role in decreasing the number of deaths , also having a high blood pressure increased the deaths .
### Were there any interesting or surprising interactions between features?
The most suprising feature is that having high blood pressure and previous myocardial infarction event scored less deaths than not having any of them.
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, Plot_One,warning=FALSE,message=FALSE}
ggplot(monica,aes(hosp,group = outcome)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..))) +
scale_y_continuous(labels = scales::percent) +
ylab("Relative Frenquencies") +
xlab("Hospitalization") +
facet_grid(outcome ~ .) +
scale_x_discrete(labels = c("Hospitalised", "Not Hospitalised")) +
labs(title = "Patient's Hospitalization status ",
subtitle = expression(paste(italic("by patient outcome")))) +
theme_gdocs()
```
### Description One:
The plot shows clearly a strong relationship between hospitalization and the survival rate of cardiovascular patients, all of those who survived were hospitalised and most of those who died were not hospitalized.
### Plot Two:
```{r echo=FALSE, Plot_Two,warning=FALSE,message=FALSE}
ggplot(aes(x = age),data = monica) +
geom_histogram(aes(fill = monica$outcome)) +
xlim(c(35,70)) +
xlab('Age') +
ylab('people count') +
labs(fill = 'Mortality outcome') +
theme_gdocs() +
ggtitle("Outcomes by age")
```
### Description Two:
The age of the patient also played a role in decreasing the number of deaths , the older the patient is the higher the risk of death becomes .
### Plot Three:
```{r echo=FALSE, Plot_Three,warning=FALSE,message=FALSE}
# get only the deaths
Hbp_diabetes_outcome <- subset(monica,monica$outcome != 'live')
# Group by high blood pressure , years , diabetes
Hbp_diabetes_outcome <- Hbp_diabetes_outcome %>%
group_by(Hbp_diabetes_outcome$highbp,Hbp_diabetes_outcome$yronset,
Hbp_diabetes_outcome$diabetes) %>%
summarise( n = n())
ggplot(aes(x = Hbp_diabetes_outcome$`Hbp_diabetes_outcome$yronset`,y = n),
data = Hbp_diabetes_outcome) +
geom_point(aes(color = Hbp_diabetes_outcome$`Hbp_diabetes_outcome$highbp`:
Hbp_diabetes_outcome$`Hbp_diabetes_outcome$diabetes`)) +
xlab('Years') +
ylab('deaths count') +
labs(color = 'High blood pressure: diabetes status ') +
scale_color_brewer(type = 'div', palette = 9, direction = 1) +
labs(title = "Deaths from cardiovascular diseases \n",
subtitle = expression(
paste(italic(" By High blood pressure and diabetes status "))))
```
### Description Three:
Generally people havinf Dibetes have a higher chance to have high blood pressure which leds to cardiovascular issues.
The plot shows an interesting and surprising fact : high blood pressure only is way more influencing on deaths than having diabetes and high blood pressure together. This discovery leds us to think that there must be some kind of relationship between Diabetes and High blood pressure from a side and the deaths caused by cardiovascular diseases from another . Or maybe having both raised the chance of being diagnosed at least by one of them and as a result receive some kind of treatments that helped reduces the risks and seriousness of the cardiovascular events , however these are just assumptions and since we don't have enough data to support them , they will still only hypothesis under the scope of this study.
# Reflection:
**Issues with this Analysis** :
The main issue was the huge amount of missing data for many variables which made it difficult to identify which variables really influence the outcome , this also made it hard to tell which variables were really correlated to each other, and Also the nature of variables ( factors ) it would be more interesting if we had high blood pressure measurements instead of a categorical variable only.
**Surprises**
Some of the well known causes of deaths in that period like diabetes didn't have much effect in this data .
**Conclusions** :
This dataSet was challenging and surprising at the same time , in the beginning I was thinking that some variables like Premi , Angina and high cholesterol will play a significant role in causing deaths for cardiovascular patients , but when taking a look at the sources of data and the nationalities of patients ( see : https://thl.fi/publications/monica/coredb/table1.htm ) ,These countries were having poor economics by then So the consumption of fatty meals was not frequent which is the first cause of cholesterol ; So this fact can explain the weak relationship between cardiovascular patients deaths and their cholesterol levels. <br>
More interesting Investigation could've been done if we had numerical data instead of categorical variables , models could've been created for each variable's trends . <br>
The key factors in this investigation were the age of the patients , their high blood pressure levels and also the hospitalization status were all improuving the rate of survivals for these patients .