Markdown.Rmd

---
title: "Bellabeat Analysis - Google Capstone Project"
author: "Sarath Chandrika K"
output:
  html_document:
    df_print: paged
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Problem Statement

For this Analysis fit bit smart device data is used to analyse and provide business insights that could help to increase the sales and to unlock new growth opportunities for Bellabeat Smart Watch company. 

To achieve this, the usage trend of various Bellabeat smart products used by people are collected, analysed and insights are drawn. With these insights a business strategy can be improved accordingly. 

Stakeholders in this analysis include 
Primary Stakeholders:

* Urska Srsen - Bellabeat’s Co Founder and Chief Creative Officer
* Sando Mur - Mathematician and Bellabeats cofounder

Secondary Stakeholders:

* Bellabeat Marketing Analytics Team

# Data Source

 [data source](https://www.kaggle.com/arashnic/fitbit). Data is collected from Kaggle.
FitBit Fitness Tracker Data. It contains personal fitness tracker data from 30 users. It contains 18 csv files that includes various data such as daily activity, daily calories, daily steps, heart rate, hourly calories, sleep and weight data. All this information is used to analyse and solve the problem statement described above. Data is gathered in monthly, weekly and hourly format based on the id assigned to each person. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

# Data Credibility
	
	To check data credibility ROCCC parameters that defines a good data set can be used. 
	
1. Reliable - Data source is not reliable since it has collective data of only 30 participants which is a huge limitation for data analysis. It's kind of biased and doesn’t represent the whole population.
2. Original - Data set is collected from a survey via Amazon mechanical turk leading to second or third party information, concluding that dataset is not original.
3. Comprehensive - Given dataset is not comprehensive. Most of the info to solve the problem statement is missing. No information about gender, age is mentioned in the data. This could lead to less accurate conclusions during the analysis part.
4. Current - Dataset is not current. It is from 2016 and it may not be used efficiently to come up with a business strategy now.
5. Cited - Dataset is not cited. Just the name of the survey is mentioned. It's difficult to confirm whether it's a credible source or not.

# Data Storage

All the collected data is stored in spreadsheets. Being around 18 files of data, it is better to preprocess data from R instead of spreadsheets alone. Data cleaning steps are executed using different packages as mentioned below. After data pre-processing, modified spreadsheets are used for analyzing and creating visualizations. 

## Loading csv files

```{r}
dailyactivity <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
dailycalories <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
dailyintensities <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
dailysteps <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
heartrate_second <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
hourlycalories <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
hourlyintensities <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
hourlySteps <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
minutecalories_narrow <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/minuteCaloriesNarrow_merged.csv")
minutecalories_wide <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/minuteCaloriesWide_merged.csv")
minuteintensities_narrow <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/minuteIntensitiesNarrow_merged.csv")
minuteintensities_wide <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/minuteIntensitiesWide_merged.csv")
minutemets_narrow <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/minuteMETsNarrow_merged.csv")
minutesleep <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")
minutessteps_narrow <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/minuteStepsNarrow_merged.csv")
minutesteps_wide <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/minuteStepsWide_merged.csv")
sleepday <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weightlog <- 
  read.csv("Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
```
## R packages

```{r}
library(lubridate)
library(tidyr)
library(stringr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(corrplot)
library(ggcorrplot)
```
# Data Summary

After observing all the data files, out of 18 files only 4 files seem to be used for Analysis. These 4 files contains data for the whole day. Rest of the files have hour, minute data which might not play an important role. Following are the four files with dataframe names

1. dailyActivity_merged.csv - dailyactivity
2. Heartrate_seconds_merged.csv - heartrate_second
3. sleepDay_merged.csv - sleepday
4. weightLogInfo_merged.csv - weightlog

# Data Preparation

## Splitting Date and Time into separate columns in dataframes
```{r}
heartrate_second<- 
  heartrate_second %>%
  separate(Time, c("Date", "Time"), " ")
sleepday <- 
  sleepday %>%
  separate(SleepDay, c("Date", "Time"), " ")
weightlog <- 
  weightlog %>%
  separate(Date, c("Date", "Time"), " ")
```

## Calculating average heartbeat in a day for each person
```{r}

heartbeat_daily <- 
  tibble(heartrate_second %>%
           group_by(Date, Id) %>%
           summarise(MeanHeartBeat=(mean(Value))))
```
## Dividing and grouping heart data time into morning, afternoon, evening, night 

```{r}
  heartrate_time <- 
    read_csv("Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
  heartrate_time$time <- dmy_hms(heartrate_time$Time)
  heartrate_time <- na.omit(heartrate_time)
  breaks <- hour(hm("6:00", "12:00", "16:00", "19:00", "23:59"))
  labels <- c("Morning", "Afternoon", "Evening", "Night")
  heartrate_time$Time_of_day <- cut(x=hour(heartrate_time$time), breaks = breaks, labels = labels,        include.lowest=TRUE)
  heartrate_time <- heartrate_time %>% drop_na()
  
```
### Grouping
```{r}
heartbeat_grouping <- 
  tibble(heartrate_time %>%
           group_by(Time_of_day) %>%
           summarise(MeanValue=(mean(Value))))
heartbeat_grouping <- heartbeat_grouping %>% drop_na()

```

## Finding duplicates in each data frame
```{r}
nrow(dailyactivity[duplicated(dailyactivity),])
nrow(heartbeat_daily[duplicated(heartbeat_daily),])
nrow(sleepday[duplicated(sleepday),])
nrow(weightlog[duplicated(weightlog),])

```

## Removing duplicates from sleepday dataframe
```{r}
sleepdata <- dplyr::distinct(sleepday)
```

## Finding null values in dataframes
```{r}
which(is.na(dailyactivity))
which(is.na(heartbeat_daily))
which(is.na(sleepday))
which(is.na(weightlog))
```

Most of the values in fat column of weight log are Null, so removing that column

```{r}
weightlog <- select(weightlog, -Fat)
```

## Creating a data frame with common users from all the data frames

```{r}
combined_df <- merge(dailyactivity, sleepday, by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))
combined_df <- merge(combined_df, weightlog, by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))
combined_df <- merge(combined_df, heartbeat_daily, by.x=c("Id", "ActivityDate"), by.y=c("Id", "Date"))
```

# Data Analysis

Analysis is done based on the modified data frames. By the summary of data in all the data frames following are the observations, conclusions.  

## Dimensions of data frames
```{r}
dim(dailyactivity)
dim(heartbeat_daily)
dim(sleepday)
dim(weightlog)
```
## Unique number of persons in each data frame

```{r}
length(unique(weightlog$Id))
length(unique(heartbeat_daily$Id))
length(unique(sleepday$Id))
length(unique(dailyactivity$Id))
length(unique(combined_df$Id))

```
Highest of 33 participants contributed to the dataset and only 3 persons contributed data for all the features. 

## Data frames - Summary 

```{r}
dailyactivity %>%  
  select(TotalSteps,
         TotalDistance,
         TrackerDistance,
         SedentaryMinutes,
         VeryActiveMinutes,
         FairlyActiveMinutes,
         LightlyActiveMinutes,
         Calories) %>%
  summary()
```
### Observations for Daily Activity data

* Average total steps per person is 7638 per day.
* Summary of measures of tracker and total distance are same. 
* Average of 2304 calories are burnt per day by a person. 
* Number of people provided data to the dataset are: 33


```{r}
sleepday %>%  
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()
```
### Observations for Sleep data

* Mean, Median and Mode value of Total Sleep Records is around 1.
* Average sleeping time is 419.5 min ~ 7hr
* Average Total time in bed is 458.6 min.
* Number of people provided data to the dataset are: 24

```{r}
summary(heartbeat_daily$MeanHeartBeat)
```
### Observations for Mean Heart beat data 

* Average value of heart rate is around 77 
* Median value of heart rate is around 73
* Number of people provided data to the dataset are: 14

```{r}
weightlog %>%
  select(WeightKg,
         BMI) %>%
  summary()
```
### Observations for Weight data

* Mean weight is 72kg.
* Mean BMI is 25.19
* Mean Fat is 23.50
* Number of people provided data to the dataset are: 8

```{r}
heartrate_time %>%
  select(Value,
         Time_of_day) %>%
  summary()
```
### Observations for Heart rate data based on time of the day

* With respect to count most of the records are during morning. 


# Data Visualization

```{r}
fig <- ggplot(data=dailyactivity, aes(x=TotalSteps, y=Calories)) + 
  geom_point() + 
  geom_smooth(method=lm) +
  labs(title="Total Steps VS Calories")
plot(fig)
```
From the above figure it can be seen that TotalSteps and Calories are positively correlated. As total steps increase number of calories also increase. In few cases even if total steps are not high, calories burnt are high.

```{r}
ggplot(data=dailyactivity, aes(x=TrackerDistance, y=TotalDistance)) +
  geom_point() + 
  geom_smooth(method = lm) +
  labs(title = "Tracker Distance VS Total Distance")
```
Total distance and Tracker Distance are almost same. This depicts that Bellabeat smart watch is accurate in calculating the distance. 

```{r, figures-side1, fig.show="hold", out.width="33.33%"}
ggplot(data=dailyactivity, aes(x=VeryActiveMinutes , y=Calories)) +
  geom_point() + 
  geom_smooth(orientation = "x") +
  labs(title = "Very Active Minutes VS Calories")

ggplot(data=dailyactivity, aes(x=FairlyActiveMinutes, y=Calories)) +
  geom_point() + 
  geom_smooth(orientation = "x") +
  labs(title = "Fairly Active Minutes VS Calories")

ggplot(data=dailyactivity, aes(x=LightlyActiveMinutes, y=Calories)) +
  geom_point() + 
  geom_smooth(orientation = "x") +
  labs(title = "Lightly Active Minutes vs Calories")
```

* From the above three graphs among Lightly Active Minutes, Fairly Active Minutes, Very Active Minutes VS Calories it is clear that fairly active minutes is negatively correlated to calories.
* Distribution of calories is more around Lightly Active Minutes.
* Most of the distribution points of very active minutes is around 0.

### Correlation Matrix 

```{r}
selected_columns <- select(dailyactivity, TotalSteps, TotalDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, Calories)
corr = cor(selected_columns)
corrplot(corr, method = 'number', tl.cex = 0.6)
```

Total Steps, Total Distance, Very Active Minutes have high correlation values with Calories.

```{r, out.width="60%", fig.align='center'}
ggplot(data = heartbeat_grouping, aes(x=Time_of_day, y=MeanValue)) + 
  geom_bar(stat = "identity") + 
  labs(title="Time of Day vs Mean Heart Value")
```
Even though most of the heart beat values recorded were in the morning, average heartbeat is low during night around 7pm - 12pm. 

```{r}
ggplot(data=sleepday, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + 
  geom_point() + 
  geom_smooth(method=lm) +
  labs(title="Total Minutes Asleep vs Total Time in Bed") 
```

```{r, figures-side, fig.show="hold", out.width="33.33%"}
ggplot(data=combined_df, aes(x=VeryActiveMinutes, y=TotalMinutesAsleep)) +
  geom_point() + 
  geom_smooth(orientation = "x") +
  labs(title = "Very Active Minutes vs Total Minutes Asleep")

ggplot(data=combined_df, aes(x=FairlyActiveMinutes, y=TotalMinutesAsleep)) +
  geom_point() + 
  geom_smooth(orientation = "x") +
  labs(title = "Fairly Active Minutes vs Total Minutes Asleep")

ggplot(data=combined_df, aes(x=LightlyActiveMinutes, y=TotalMinutesAsleep)) +
  geom_point() + 
  geom_smooth(orientation = "x") +
  labs(title = "Lightly Active Minutes vs Total Minutes Asleep")
```

# Conclusions from Data Analysis and Data Visualization

* Mean and median of heart rate value are very near to actual heart rate value.
* No big difference is identified between total time in bed and total sleeping time which is a good point in terms of people's health and average sleeping time is around 7hr. It can be concluded that most of the people turn out to have sufficient sleep.
* Comparing the mean of BMI to the ideal BMI range for normal weight status which is 18.5 - 24.9, it can be assumed that on an average most people are in normal range. 
* Bellabeat step prediction function has turned out to be most efficient since the tracker measures and total measures are closely same. 
* Number of people who contributed to the dataset is low and not all 30 members data have been collected in all the parameters/features. 
* Count of people who contributed data to complete features is 3 which is pretty low and is difficult to draw analysis from this. 

# Conclusions for Bellabeat Marketing Analytics Team

* Based on daily activity correlation company can improve on sending push notifications for reminding to be active in frequent intervals, as movement such as total steps, total distance tend to burn more calories. 
* A feature can be developed to set minimum movement target and monitoring it on timely basis. 
* Maximum heartbeat is around 200 which is pretty high, so company can develop few alerts to the users based on abnormal heart rate change excluding the conditions of very active minutes.
* To continue with the same pace of total sleep time of 7hr, company can develop remainder feature for bed time based on the sleeping time of individual.

In terms of additional data, age, gender of the individual is not mentioned which play an important role. It would be better to analyse if data is collected for all the features from everyone since there are only 3 people contributing to complete dataset.