-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMarkdown.Rmd
387 lines (310 loc) · 14.8 KB
/
Markdown.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
---
title: "Bellabeat Analysis - Google Capstone Project"
author: "Sarath Chandrika K"
output:
html_document:
df_print: paged
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Problem Statement
For this Analysis fit bit smart device data is used to analyse and provide business insights that could help to increase the sales and to unlock new growth opportunities for Bellabeat Smart Watch company.
To achieve this, the usage trend of various Bellabeat smart products used by people are collected, analysed and insights are drawn. With these insights a business strategy can be improved accordingly.
Stakeholders in this analysis include
Primary Stakeholders:
* Urska Srsen - Bellabeat’s Co Founder and Chief Creative Officer
* Sando Mur - Mathematician and Bellabeats cofounder
Secondary Stakeholders:
* Bellabeat Marketing Analytics Team
# Data Source
[data source](https://www.kaggle.com/arashnic/fitbit). Data is collected from Kaggle.
FitBit Fitness Tracker Data. It contains personal fitness tracker data from 30 users. It contains 18 csv files that includes various data such as daily activity, daily calories, daily steps, heart rate, hourly calories, sleep and weight data. All this information is used to analyse and solve the problem statement described above. Data is gathered in monthly, weekly and hourly format based on the id assigned to each person. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
# Data Credibility
To check data credibility ROCCC parameters that defines a good data set can be used.
1. Reliable - Data source is not reliable since it has collective data of only 30 participants which is a huge limitation for data analysis. It's kind of biased and doesn’t represent the whole population.
2. Original - Data set is collected from a survey via Amazon mechanical turk leading to second or third party information, concluding that dataset is not original.
3. Comprehensive - Given dataset is not comprehensive. Most of the info to solve the problem statement is missing. No information about gender, age is mentioned in the data. This could lead to less accurate conclusions during the analysis part.
4. Current - Dataset is not current. It is from 2016 and it may not be used efficiently to come up with a business strategy now.
5. Cited - Dataset is not cited. Just the name of the survey is mentioned. It's difficult to confirm whether it's a credible source or not.
# Data Storage
All the collected data is stored in spreadsheets. Being around 18 files of data, it is better to preprocess data from R instead of spreadsheets alone. Data cleaning steps are executed using different packages as mentioned below. After data pre-processing, modified spreadsheets are used for analyzing and creating visualizations.
## Loading csv files
```{r}
dailyactivity <-
read.csv("Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
dailycalories <-
read.csv("Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
dailyintensities <-
read.csv("Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
dailysteps <-
read.csv("Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
heartrate_second <-
read.csv("Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
hourlycalories <-
read.csv("Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
hourlyintensities <-
read.csv("Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
hourlySteps <-
read.csv("Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
minutecalories_narrow <-
read.csv("Fitabase Data 4.12.16-5.12.16/minuteCaloriesNarrow_merged.csv")
minutecalories_wide <-
read.csv("Fitabase Data 4.12.16-5.12.16/minuteCaloriesWide_merged.csv")
minuteintensities_narrow <-
read.csv("Fitabase Data 4.12.16-5.12.16/minuteIntensitiesNarrow_merged.csv")
minuteintensities_wide <-
read.csv("Fitabase Data 4.12.16-5.12.16/minuteIntensitiesWide_merged.csv")
minutemets_narrow <-
read.csv("Fitabase Data 4.12.16-5.12.16/minuteMETsNarrow_merged.csv")
minutesleep <-
read.csv("Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")
minutessteps_narrow <-
read.csv("Fitabase Data 4.12.16-5.12.16/minuteStepsNarrow_merged.csv")
minutesteps_wide <-
read.csv("Fitabase Data 4.12.16-5.12.16/minuteStepsWide_merged.csv")
sleepday <-
read.csv("Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weightlog <-
read.csv("Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
```
## R packages
```{r}
library(lubridate)
library(tidyr)
library(stringr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(corrplot)
library(ggcorrplot)
```
# Data Summary
After observing all the data files, out of 18 files only 4 files seem to be used for Analysis. These 4 files contains data for the whole day. Rest of the files have hour, minute data which might not play an important role. Following are the four files with dataframe names
1. dailyActivity_merged.csv - dailyactivity
2. Heartrate_seconds_merged.csv - heartrate_second
3. sleepDay_merged.csv - sleepday
4. weightLogInfo_merged.csv - weightlog
# Data Preparation
## Splitting Date and Time into separate columns in dataframes
```{r}
heartrate_second<-
heartrate_second %>%
separate(Time, c("Date", "Time"), " ")
sleepday <-
sleepday %>%
separate(SleepDay, c("Date", "Time"), " ")
weightlog <-
weightlog %>%
separate(Date, c("Date", "Time"), " ")
```
## Calculating average heartbeat in a day for each person
```{r}
heartbeat_daily <-
tibble(heartrate_second %>%
group_by(Date, Id) %>%
summarise(MeanHeartBeat=(mean(Value))))
```
## Dividing and grouping heart data time into morning, afternoon, evening, night
```{r}
heartrate_time <-
read_csv("Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
heartrate_time$time <- dmy_hms(heartrate_time$Time)
heartrate_time <- na.omit(heartrate_time)
breaks <- hour(hm("6:00", "12:00", "16:00", "19:00", "23:59"))
labels <- c("Morning", "Afternoon", "Evening", "Night")
heartrate_time$Time_of_day <- cut(x=hour(heartrate_time$time), breaks = breaks, labels = labels, include.lowest=TRUE)
heartrate_time <- heartrate_time %>% drop_na()
```
### Grouping
```{r}
heartbeat_grouping <-
tibble(heartrate_time %>%
group_by(Time_of_day) %>%
summarise(MeanValue=(mean(Value))))
heartbeat_grouping <- heartbeat_grouping %>% drop_na()
```
## Finding duplicates in each data frame
```{r}
nrow(dailyactivity[duplicated(dailyactivity),])
nrow(heartbeat_daily[duplicated(heartbeat_daily),])
nrow(sleepday[duplicated(sleepday),])
nrow(weightlog[duplicated(weightlog),])
```
## Removing duplicates from sleepday dataframe
```{r}
sleepdata <- dplyr::distinct(sleepday)
```
## Finding null values in dataframes
```{r}
which(is.na(dailyactivity))
which(is.na(heartbeat_daily))
which(is.na(sleepday))
which(is.na(weightlog))
```
Most of the values in fat column of weight log are Null, so removing that column
```{r}
weightlog <- select(weightlog, -Fat)
```
## Creating a data frame with common users from all the data frames
```{r}
combined_df <- merge(dailyactivity, sleepday, by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))
combined_df <- merge(combined_df, weightlog, by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))
combined_df <- merge(combined_df, heartbeat_daily, by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))
```
# Data Analysis
Analysis is done based on the modified data frames. By the summary of data in all the data frames following are the observations, conclusions.
## Dimensions of data frames
```{r}
dim(dailyactivity)
dim(heartbeat_daily)
dim(sleepday)
dim(weightlog)
```
## Unique number of persons in each data frame
```{r}
length(unique(weightlog$Id))
length(unique(heartbeat_daily$Id))
length(unique(sleepday$Id))
length(unique(dailyactivity$Id))
length(unique(combined_df$Id))
```
Highest of 33 participants contributed to the dataset and only 3 persons contributed data for all the features.
## Data frames - Summary
```{r}
dailyactivity %>%
select(TotalSteps,
TotalDistance,
TrackerDistance,
SedentaryMinutes,
VeryActiveMinutes,
FairlyActiveMinutes,
LightlyActiveMinutes,
Calories) %>%
summary()
```
### Observations for Daily Activity data
* Average total steps per person is 7638 per day.
* Summary of measures of tracker and total distance are same.
* Average of 2304 calories are burnt per day by a person.
* Number of people provided data to the dataset are: 33
```{r}
sleepday %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
```
### Observations for Sleep data
* Mean, Median and Mode value of Total Sleep Records is around 1.
* Average sleeping time is 419.5 min ~ 7hr
* Average Total time in bed is 458.6 min.
* Number of people provided data to the dataset are: 24
```{r}
summary(heartbeat_daily$MeanHeartBeat)
```
### Observations for Mean Heart beat data
* Average value of heart rate is around 77
* Median value of heart rate is around 73
* Number of people provided data to the dataset are: 14
```{r}
weightlog %>%
select(WeightKg,
BMI) %>%
summary()
```
### Observations for Weight data
* Mean weight is 72kg.
* Mean BMI is 25.19
* Mean Fat is 23.50
* Number of people provided data to the dataset are: 8
```{r}
heartrate_time %>%
select(Value,
Time_of_day) %>%
summary()
```
### Observations for Heart rate data based on time of the day
* With respect to count most of the records are during morning.
# Data Visualization
```{r}
fig <- ggplot(data=dailyactivity, aes(x=TotalSteps, y=Calories)) +
geom_point() +
geom_smooth(method=lm) +
labs(title="Total Steps VS Calories")
plot(fig)
```
From the above figure it can be seen that TotalSteps and Calories are positively correlated. As total steps increase number of calories also increase. In few cases even if total steps are not high, calories burnt are high.
```{r}
ggplot(data=dailyactivity, aes(x=TrackerDistance, y=TotalDistance)) +
geom_point() +
geom_smooth(method = lm) +
labs(title = "Tracker Distance VS Total Distance")
```
Total distance and Tracker Distance are almost same. This depicts that Bellabeat smart watch is accurate in calculating the distance.
```{r, figures-side1, fig.show="hold", out.width="33.33%"}
ggplot(data=dailyactivity, aes(x=VeryActiveMinutes , y=Calories)) +
geom_point() +
geom_smooth(orientation = "x") +
labs(title = "Very Active Minutes VS Calories")
ggplot(data=dailyactivity, aes(x=FairlyActiveMinutes, y=Calories)) +
geom_point() +
geom_smooth(orientation = "x") +
labs(title = "Fairly Active Minutes VS Calories")
ggplot(data=dailyactivity, aes(x=LightlyActiveMinutes, y=Calories)) +
geom_point() +
geom_smooth(orientation = "x") +
labs(title = "Lightly Active Minutes vs Calories")
```
* From the above three graphs among Lightly Active Minutes, Fairly Active Minutes, Very Active Minutes VS Calories it is clear that fairly active minutes is negatively correlated to calories.
* Distribution of calories is more around Lightly Active Minutes.
* Most of the distribution points of very active minutes is around 0.
### Correlation Matrix
```{r}
selected_columns <- select(dailyactivity, TotalSteps, TotalDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, Calories)
corr = cor(selected_columns)
corrplot(corr, method = 'number', tl.cex = 0.6)
```
Total Steps, Total Distance, Very Active Minutes have high correlation values with Calories.
```{r, out.width="60%", fig.align='center'}
ggplot(data = heartbeat_grouping, aes(x=Time_of_day, y=MeanValue)) +
geom_bar(stat = "identity") +
labs(title="Time of Day vs Mean Heart Value")
```
Even though most of the heart beat values recorded were in the morning, average heartbeat is low during night around 7pm - 12pm.
```{r}
ggplot(data=sleepday, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) +
geom_point() +
geom_smooth(method=lm) +
labs(title="Total Minutes Asleep vs Total Time in Bed")
```
```{r, figures-side, fig.show="hold", out.width="33.33%"}
ggplot(data=combined_df, aes(x=VeryActiveMinutes, y=TotalMinutesAsleep)) +
geom_point() +
geom_smooth(orientation = "x") +
labs(title = "Very Active Minutes vs Total Minutes Asleep")
ggplot(data=combined_df, aes(x=FairlyActiveMinutes, y=TotalMinutesAsleep)) +
geom_point() +
geom_smooth(orientation = "x") +
labs(title = "Fairly Active Minutes vs Total Minutes Asleep")
ggplot(data=combined_df, aes(x=LightlyActiveMinutes, y=TotalMinutesAsleep)) +
geom_point() +
geom_smooth(orientation = "x") +
labs(title = "Lightly Active Minutes vs Total Minutes Asleep")
```
# Conclusions from Data Analysis and Data Visualization
* Mean and median of heart rate value are very near to actual heart rate value.
* No big difference is identified between total time in bed and total sleeping time which is a good point in terms of people's health and average sleeping time is around 7hr. It can be concluded that most of the people turn out to have sufficient sleep.
* Comparing the mean of BMI to the ideal BMI range for normal weight status which is 18.5 - 24.9, it can be assumed that on an average most people are in normal range.
* Bellabeat step prediction function has turned out to be most efficient since the tracker measures and total measures are closely same.
* Number of people who contributed to the dataset is low and not all 30 members data have been collected in all the parameters/features.
* Count of people who contributed data to complete features is 3 which is pretty low and is difficult to draw analysis from this.
# Conclusions for Bellabeat Marketing Analytics Team
* Based on daily activity correlation company can improve on sending push notifications for reminding to be active in frequent intervals, as movement such as total steps, total distance tend to burn more calories.
* A feature can be developed to set minimum movement target and monitoring it on timely basis.
* Maximum heartbeat is around 200 which is pretty high, so company can develop few alerts to the users based on abnormal heart rate change excluding the conditions of very active minutes.
* To continue with the same pace of total sleep time of 7hr, company can develop remainder feature for bed time based on the sleeping time of individual.
In terms of additional data, age, gender of the individual is not mentioned which play an important role. It would be better to analyse if data is collected for all the features from everyone since there are only 3 people contributing to complete dataset.