Skip to content

Commit

Permalink
Update Bayesian learning
Browse files Browse the repository at this point in the history
  • Loading branch information
ShengjiaZhao committed Mar 5, 2021
1 parent 8e17f10 commit 2d73ec1
Showing 1 changed file with 42 additions and 40 deletions.
82 changes: 42 additions & 40 deletions learning/bayesian/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ The learning approaches we have discussed so far are based on the principle of m

## Example 1

Let's suppose we are interested in modeling the outcome of a biased coin, $$X = \{heads, tails\}$$. We toss the coin 10 times, observing 6 heads. If $$\theta$$ denotes the probability of observing heads, the maximum likelihood estimate (MLE) is given by,
Let's suppose we are interested in modeling the outcome of a biased coin, $$X \in \{heads, tails\}$$. We toss the coin 10 times, observing 6 heads. If $$\theta$$ denotes the probability of observing heads, the maximum likelihood estimate (MLE) is given by,

$$ \theta_{MLE} = \frac{num\_heads}{num\_heads + num\_tails} = 0.6 $$

Now, suppose we continue tossing the coin such that after a 100 total trials (including the 10 initial trials), we observe 60 heads. Again, we can compute the MLE as,
Now, suppose we continue tossing the coin such that after 100 total trials (including the 10 initial trials), we observe 60 heads. Again, we can compute the MLE as,

$$ \theta_{MLE} = \frac{num\_heads}{num\_heads + num\_tails} = 0.6 $$

In both the above situations, the maximum likelihood estimate does not change as we observe more data. This seems counterintuitive - our _confidence_ in predicting heads with probability 0.6 should be higher in the second setting where we have seen many more trials of the coin! The reason why MLE fails to distinguish the two settings is due to an implicit assumption we have been making all along. MLE assumes that the only source of uncertainty is due to the variables, $$X$$ and the quantification of this uncertainty is based on a fixed parameter $$\theta_{MLE}$$.
In both the above situations, the maximum likelihood estimate does not change as we observe more data. This seems counterintuitive - our _confidence_ in predicting heads with probability 0.6 should be higher in the second setting where we have seen many more trials of the coin! The key problem is that we represent our belief about the probability of heads $$\theta$$ as a single number $$\theta_{MLE}$$, so there is no way to represent whether we are more or less sure about $$\theta$$.

## Example 2

Expand All @@ -24,93 +24,95 @@ Consider a language model for sentences based on the bag-of-words assumption. In
For simplicity, assume that our language corpus consists of a single sentence, "Probabilistic graphical models are fun. They are also powerful." We can estimate the probability of each of the individual words based on the counts. Our corpus contains 10 words with each word appearing once, and hence, each word in the corpus is assigned a probability of 0.1. Now, while testing the generalization of our model to the English language, we observe another sentence, "Probabilistic graphical models are hard." The probability of the sentence under our model is
$$0.1 \times 0.1 \times 0.1 \times 0.1 \times 0 = 0$$. We did not observe one of the words ("hard") during training which made our language model infer the sentence as impossible, even though it is a perfectly plausible sentence.

Out-of-vocabulary words are a common phenomena even for language models trained on large corpus. One of the simplest ways to handle these words is to assign a prior probability of observing an out-of-vocabulary word such that the model will assign a low, but non-zero probability to test sentences containing such words. This mechanism of incorporating prior knowledge is a practical application of Bayesian learning, which we present next.
Out-of-vocabulary words are a common phenomena even for language models trained on large corpus. One of the simplest ways to handle these words is to assign a prior probability of observing an out-of-vocabulary word such that the model will assign a low, but non-zero probability to test sentences containing such words.

## Setup

In contrast to maximum likelihood learning, Bayesian learning explicitly models uncertainty over both the variables, $$X$$ and the parameters, $$\theta$$. In other words, the model parameters $$\theta$$ are random variables as well.
In contrast to maximum likelihood learning, Bayesian learning explicitly models uncertainty over both the observed variables $$X$$ and the parameters $$\theta$$. In other words, the parameters $$\theta$$ are random variables as well.

A _prior_ distribution over the parameters, $$p(\theta)$$ encodes our initial beliefs. These beliefs are subjective. For example, we can choose the prior over $$\theta$$ for a biased coin to be uniform between 0 and 1. If however we expect the coin to be fair, the prior distribution can be peaked around $$\theta = 0.5$$. We will discuss commonly used priors later in this chapter.

Observing data $$D$$ in the form of evidence allows us to update our beliefs using Bayes' rule,
If we observed the dataset $$\mathcal{D} = \lbrace X_1, \cdots, X_N \rbrace$$ (in the coin toss example, each $X_i$ is the outcome of one toss of the coin) we can update our beliefs using Bayes' rule,

$$
p(\theta \mid D) = \frac{p(D \mid \theta) \, p(\theta)}{p(D)} \propto p(D \mid \theta) \, p(\theta)
p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta) \, p(\theta)}{p(\mathcal{D})} \propto p(\mathcal{D} \mid \theta) \, p(\theta)
$$

$$
posterior \propto likelihood \times prior
$$

Hence, Bayesian learning provides a principled mechanism for incorporating prior knowledge into our model. This prior knowledge is useful in many situations such as when want to provide uncertainty estimates about the model parameters (Example 1) or when the data available for learning a model is limited (Example 2).
Hence, Bayesian learning provides a principled mechanism for incorporating prior knowledge into our model. Bayesian learning is useful in many situations such as when want to provide uncertainty estimates about the model parameters (Example 1) or when the data available for learning a model is limited (Example 2).



## Conjugate Priors
When calculating posterior distribution using Bayes' rule, as in the above, it should be pretty straightforward to calculate the numerator. But to calculate the denominator $$P(D)$$, we are required to compute an integral. This might cause us trouble, since for an arbitrary distribution, computing the integral is likely to be intractable.

To tackle this issue, we use a conjugate prior. A parametric family $$\varphi$$ is conjugate for the likelihood $$P(D \mid \theta)$$ if:

When calculating posterior distribution using Bayes' rule, as in the above, it should be pretty straightforward to calculate the numerator. But to calculate the denominator $$p(\mathcal{D})$$, we are required to compute an integral
$$
P(\theta) \in \varphi \Longrightarrow P(\theta \mid D) \in \varphi
p(\mathcal{D}) = \int_\theta p(\mathcal{D} \mid \theta)p(\theta)d\theta
$$
This might cause us trouble, since integration is usually difficult. For this very simple example, we might be able to compute this integral, but as you may have seen many times in this class, if $\theta$ is high dimensional that computing integrals could be quite challenging.

This is convenient because if we know the normalizing constant of $$\varphi$$, then we get the denominator in Bayes' rule "for free". Thus it essentially reduces the computation of the posterior from a tricky numerical integral to some simple algebra.

To see conjugate prior in action, let's consider an example. Suppose we are given a sequence of $$N$$ coin tosses, $$D = \{X_{1},\ldots,X_{N}\}$$. We want to infer the probability of getting heads which we denote by $$\theta$$. Now, we can model this as a sequence of Bernoulli trials with parameter $$\theta$$. A natural conjugate prior in this case is the beta distribution with

To tackle this issue, people have observed that for some choices of prior $p(\theta)$, the posterior distribution $$p(\theta \mid \mathcal{D})$$ can be directly computed in closed form. Going back to our coin toss example, where we are given a sequence of $$N$$ coin tosses, $$\mathcal{D} = \{X_{1},\ldots,X_{N}\}$$ and we want to infer the probability of getting heads $$\theta$$ using Bayes rule. Suppose we choose the prior $$p(\theta)$$ as the Beta distribution defined by
$$
P(\theta) = Beta(\theta \mid \alpha_H, \alpha_T) = \frac{\theta^{\alpha_H -1 }(1-\theta)^{\alpha_T -1 }}{B(\alpha_H,\alpha_T)}
$$

where the normalization constant $$B(\cdot)$$ is the beta function. Here $$\alpha = (\alpha_H,\alpha_T)$$ are called the hyperparameters of the prior. The expected value of $$\theta$$ is $$\frac{\alpha_H}{\alpha_H+\alpha_T}$$. Here the sum of the hyperparameters $$(\alpha_H+\alpha_T)$$ can be interpreted as a measure of confidence in the expectations they lead to. Intuitively, we can think of $$\alpha_H$$ as the number of heads we have observed before the current dataset.
where $$\alpha_H$$ and $$\alpha_T$$ are the two parameters that determine the shape of the distribution (similar to how the mean and variance determine a Gaussian distribution), and $$B(\alpha_H, \alpha_T)$$ is some normalization constant that ensures $$\int p(\theta)d\theta=1$$. We will go into more details about the Beta distribution later. What matters here is that the Beta distribution has a very special property: the posterior $$p(\theta \mid \mathcal{D})$$ is always another Beta distribution (but with different parameters). More concretely, out of $$N$$ coin tosses, if the number of heads and the number of tails are $$N_H$$ and $$N_T$$ respectively, then it can be shown that the posterior is:
$$
P(\theta \mid \mathcal{D}) = Beta(\theta \mid \alpha_H+N_H,\alpha_T+H_T) = \frac{\theta^{N_H+ \alpha_H -1 }(1-\theta)^{ N_T+ \alpha_T -1 }}{B(N_H+ \alpha_H,N_T+ \alpha_T)}
$$

which is another Beta distribution with parameters $$(\alpha_H+N_H, \alpha_T+N_T)$$. In other words, if the prior is a Beta distribution (we can represent it as two numbers $$\alpha_H,\alpha_T$$) then the posterior can be immediately computed by a simple addition $$\alpha_H+N_H, \alpha_T+N_T$$. There is no need to compute the complex integral $$p(\mathcal{D})$$.



{% include marginfigure.html id="beta" url="assets/img/beta.png" description="Here the exponents $$(3,2)$$ and $$(30,20)$$ can both be used to encode the belief that $$\theta$$ is $$0.6.$$ But the second set of exponents imply a stronger belief as they are based on a larger sample." %}

Out of $$N$$ coin tosses, if the number of heads and the number of tails are $$N_H$$
and $$N_T$$ respectively, then it can be shown that the posterior is:
Now we try to understand the Beta distribution better. If $$\theta$$ has distribution $$Beta(\theta \mid \alpha_H, \alpha_T)$$, then the expected value of $$\theta$$ is $$\frac{\alpha_H}{\alpha_H+\alpha_T}$$. Intuitively, $$\alpha_H$$ is larger than $$\alpha_T$$ if we believe that heads are more likely. The variance of the Beta distribution is the somewhat complex expression $$\frac{\alpha_H\alpha_T}{(\alpha_H+\alpha_T)^2(\alpha_H+\alpha_T+1)}$$, but we remark that (very roughly) the numerator is quadratic in $$\alpha_H,\alpha_T$$ while the denominator is cubic in $$\alpha_H,\alpha_T$$. Hence if $$\alpha_H$$ and $$\alpha_T$$ are bigger, the variance is smaller, so we are more certain about the value of $$\theta$$. We can use this observation to better understand the above posterior update rule: after observing more data $$\mathcal{D}$$, the prior parameters $$\alpha_H$$ and $$\alpha_T$$ increases by $$N_H$$ and $$N_T$$ respectively. Thus, the variance of $$p(\theta \mid \mathcal{D})$$ should be smaller than $$p(\theta)$$, i.e. we are more certain about the value of $$\theta$$ after observing data $$\mathcal{D}$$.


$$
P(\theta \mid N_H, N_T) = \frac{\theta^{N_H+ \alpha_H -1 }(1-\theta)^{ N_T+ \alpha_T -1 }}{B(N_H+ \alpha_H,N_T+ \alpha_T)}
$$

which is another Beta distribution with parameters $$(N_H + \alpha_H, N_T + \alpha_T)$$. We can use this posterior distribution as the prior for more samples with the hyperparameters simply adding each extra piece of information as it comes from additional coin tosses.
The idea we presented here is usually called "conjugacy". Using standard terminlogy, what we have shown here is that the Beta distribution family is a "conjugate prior" to the Bernoulli distribution family. When people say that distribution family A is a conjugate prior to distribution family B, they mean that if $$p(\theta)$$ belongs to distribution family A, and $$p(X \mid \theta)$$ belongs to distribution family $$B$$, then given a dataset of samples $$\mathcal{D} = (X_1, \cdots, X_N)$$ the posterior $$p(\theta \mid \mathcal{D})$$ is still in distribution family $$A$$. Relating this back to the example we have above, if $$p(\theta)$$ is a Beta distribution, and $$p(X \mid \theta)$$ is a Bernoulli distribution, then $$p(\theta \mid \mathcal{D})$$ is still a Beta distribution. In general we usually have a simple algebra expression to compute $$p(\theta \mid \mathcal{D})$$ (such as computing $$\alpha_H+N_H, \alpha_T+N_H$$ in the example above).

### Categorical Generalization

We can now extend the binary model to its categorical generalization. Instead of being limited to binary outcomes, we can now consider the categorical dataset of a $$K$$-sided dice rolled $$N$$ times. Let $$\mathcal{D} = \{ X_1 = k_1, \ldots, X_N = K_N \}$$, where $$X_j \in \{ 1, \ldots, K \}$$ for the $$j$$th outcome. The parameterization of the model is $$\theta = (P(X_j = 1), \ldots, P(X_j = K))$$, which denotes the probability of each outcome, and where $$\sum_{k = 1}^K P(X_j = k) = 1$$.

The likelihood of observing our dataset given a specific parameterization is
### Categorical Distribution

We give another example of a conjugate prior which generalizes the Bernoulli example above. Instead of being limited to binary outcomes, we can now consider the categorical distribution (think of a $$K$$-sided dice). Let $$\mathcal{D} = \{ X_1, \ldots, X_N \}$$ be $N$ rolls of the dice, where $$X_j \in \{ 1, \ldots, K \}$$ is the outcome of the $$j$$-th roll. The parameter of the categorical distribution is denoted by $\theta$
$$
P(\mathcal{D} \mid \theta) = \prod_{k=1}^K P(X_j = k)^{\sum_{j=1}^N 1\{ X_j = k \}}
\theta =(\theta_1, \cdots, \theta_K) := (P(X_j = 1), \ldots, P(X_j = K))
$$
where $$\sum_{k = 1}^K \theta_k = 1$$.

In the same manner as with the binary model, the conjugate prior for this categorical model is the Dirichlet distribution, which has hyperparameters $$\mathbf{\alpha} = (\alpha_1, \ldots, \alpha_K)$$, indicating the number of observations of each outcome. Letting $$\alpha$$ be the "virtual" counts of the $$K$$ outcomes before we observe the dataset, our prior is
We claim that the Dirichlet distribution is the conjugate prior for the categorical distribution. A Dirichlet distribution is defined by $$K$$ parameters $$\mathbf{\alpha} = (\alpha_1, \ldots, \alpha_K)$$, and its PDF is given by

$$
P(\theta) = \textsf{Dirichlet}(\theta \mid \mathbf{\alpha}) = \frac{1}{B(\alpha)} \prod_{k=1}^K P(X_j = k)^{\alpha_k - 1}
P(\theta) = \textsf{Dirichlet}(\theta \mid \mathbf{\alpha}) = \frac{1}{B(\alpha)} \prod_{k=1}^K \theta_k^{\alpha_k - 1}
$$

where $$B(\cdot)$$ is still a normalization factor.
where $$B(\alpha)$$ is still a normalization constant.

Because we use a Dirichlet prior, the posterior is also a Dirichlet distribution, and is formulated as follows:
To show that the Dirichlet distribution is the conjugate prior for the categorial distribution, we need to show that the posterior is also a Dirichlet distribution. To calaulate the posterior $$p(\theta \mid \mathcal{D})$$ with Bayes rule we first calculate the likelihood $$p(\mathcal{D} \mid \theta)$$ as

$$
P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) P(\theta) \propto \prod_{k=1}^K P(X_j = k)^{\sum_{j=1}^N 1\{ X_j = k \} + \alpha_k - 1}
P(\mathcal{D} \mid \theta) = \prod_{k=1}^K \theta_k^{\sum_{j=1}^N 1\{ X_j = k \}}
$$

We can see that this is equivalent to a Dirichlet distribution with updated counts $$\alpha'$$. Specifically,

To simply the notation we denote $$N_k = \sum_{j=1}^N 1\lbrace X_j=k\rbrace$$ as the number of times we roll out $$k$$, so $$p(\mathcal{D}\mid\theta)=\prod\theta_k^{N_k}$$. Using this new notation the posterior can be calculated as
$$
P(\theta \mid \mathcal{D}) \propto \mathsf{Dirichlet}(\theta \mid \alpha')
P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) P(\theta) \propto \prod_{k=1}^K \theta_k^{N_k + \alpha_k - 1}:=\textsf{Dirichlet}(\theta \mid \alpha_1+N_1,\cdots,\alpha_K+N_K)
$$

where the updated count $$\alpha'$$ is given by
In other words, if the prior is a Dirichlet distribution with parameter $$(\alpha_1, \cdots, \alpha_K)$$ then the posterior $$p(\theta \mid \mathcal{D})$$ is a Dirichlet distribution with parameters $$(\alpha_1+N_1, \cdots, \alpha_K+N_K)$$.

### Some Concluding Remarks

Many distributions have conjugate priors. In fact, any exponential family distribution have a conjugate prior. Even though conjugacy seemingly solve the problem of computing Bayesian posteriors, there are two caveats: 1. Usually practitioners will want to choose the prior $p(\theta)$ to best capture his or her knowledge about the problem, and using conjugate priors is a strong restriction. 2. For more complex distributions, the posterior computation is not as easy as those in our examples. There are distributions for which the posterior computation is still NP hard.

Conjugate priors is a powerful tool used in many real world applications such as topic modeling (e.g. latent dirichlet allocation) and medical diagnosis. However, practitioners should be mindful of its short-comings and consider and compare with other tools such as MCMC or variational inference (also covered in these lecture notes).


$$
\alpha'_k = \underbrace{\sum_{j=1}^N 1\{ X_j = k \}}_\text{observed data count} + \underbrace{\alpha_k}_\text{prior virtual count}
$$

<br/>

Expand Down

0 comments on commit 2d73ec1

Please sign in to comment.