Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve structure learning notes #207

Merged
merged 2 commits into from
Jul 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion _layouts/post.html
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
---
<h1>{{ page.title | capitalize }}</h1>
<h1>{{ page.title }}</h1>
<p class="subtitle">{{ page.date | date: "%B %-d, %Y" }}</p>


Expand Down
56 changes: 28 additions & 28 deletions learning/structure/index.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,21 @@
---
layout: post
title: Structure Learning for Bayesian Networks
title: Structure learning for Bayesian networks
---
## Structure learning for Bayesian networks

The task of structure learning for Bayesian networks refers to learn the structure of the directed acyclic graph (DAG) from data. There are two major approaches for the structure learning: score-based approach and constraint-based approach .
The task of structure learning for Bayesian networks refers to learning the structure of the directed acyclic graph (DAG) from data. There are two major approaches for structure learning: score-based and constraint-based.

### Score-based approach

The score-based approach first defines a criterion to evaluate how well the Bayesian network fits the data, then searches over the space of DAGs for a structure with maximal score. In this way, the score-based approach is essentially a search problem and consists of two parts: the definition of score metric and the search algorithm.
The score-based approach first defines a criterion to evaluate how well the Bayesian network fits the data, then searches over the space of DAGs for a structure achieving the maximal score. The score-based approach is essentially a search problem that consists of two parts: the definition of a score metric and the search algorithm.

### Score metrics

The score metrics for a structure $$\mathcal{G}$$ and data $$D$$ can be generally defined as:

$$ Score(G:D)= LL(G:D) - \phi(|D|) \|G\|. $$
$$ Score(G:D) = LL(G:D) - \phi(|D|) \|G\|. $$

Here $$LL(G:D)$$ refers to the log-likelihood of the data under the graph structure $$\mathcal{G}.$$ The parameters in Bayesian network $$G$$ are estimated based on MLE and the log-likelihood score is calculated based on the estimated parameters. If we consider only the log-likelihood in the score function, we will end up with an overfitting structure (namely, a complete graph.) That is why we have the second term in the scoring function. $$\lvert D \rvert$$ is the number of sample and $$\|G\|$$ is the number of parameters in the graph $$\mathcal{G}$$. With this extra term, we will penalize the over-complicated graph structure and avoid overfitting. For AIC the function $$\phi(t) = 1, $$ while for BIC $$\phi(t) = \log(t)/2.$$ It is important to note that in BIC, the influence of model complexity will decrease as M grows, allowing the log-likelihood term to eventually dominate the score.
Here $$LL(G:D)$$ refers to the log-likelihood of the data under the graph structure $$\mathcal{G}$$. The parameters in the Bayesian network $$G$$ are estimated based on MLE and the log-likelihood score is calculated based on the estimated parameters. If the score function only consisted of the log-likelihood term, then the optimal graph would be a complete graph, which is probably overfitting the data. Instead, the second term $$\phi(|D|) \|G\|$$ in the scoring function serves as a regularization term, favoring simpler models. $$\lvert D \rvert$$ is the number of data samples, and $$\|G\|$$ is the number of parameters in the graph $$\mathcal{G}$$. When $$\phi(t) = 1$$, the score function is known as the Akaike Information Criterion (AIC). When $$\phi(t) = \log(t)/2$$, the score function is known as the Bayesian Information Criterion (BIC). With the BIC, the influence of model complexity decreases as $$\lvert D \rvert$$ grows, allowing the log-likelihood term to eventually dominate the score.

There is another family of Bayesian score function called BD (Bayesian Dirichlet) score. For BD score, if first defines the probability of data $$D$$ conditional on the graph structure $$\mathcal{G}$$ as

Expand All @@ -27,7 +26,8 @@ $$
where $$P(D \mid \mathcal{G}, \Theta_{\mathcal{G}})$$ is the probability of the data given the network structure and parameters, and $$P(\Theta_{\mathcal{G}} \mid \mathcal{G})$$ is the prior probability of the parameters. When the prior probability is specified as a Dirichlet distribution,

$$
P(D|\Theta_{\mathcal{G}}) = \prod_{i} \prod_{\pi_i} \left[ \frac{\Gamma(\sum_j N'_{i,\pi_i,j})}{\Gamma(\sum_j N'_{i,\pi_i,j} + N_{i,\pi_i,j} )} \prod_{j}\frac{\Gamma(N'_{i,\pi_i,j} + N_{i,\pi_i,j})}{\Gamma(N'_{i,\pi_i,j})}\right].
P(D|\Theta_{\mathcal{G}})
= \prod_i \prod_{\pi_i} \left[ \frac{\Gamma(\sum_j N'_{i,\pi_i,j})}{\Gamma(\sum_j N'_{i,\pi_i,j} + N_{i,\pi_i,j} )} \prod_{j}\frac{\Gamma(N'_{i,\pi_i,j} + N_{i,\pi_i,j})}{\Gamma(N'_{i,\pi_i,j})}\right].
$$

Here $$\pi_i$$ refers to the parent configuration of the variable $$i$$ and $$N_{i,\pi_i,j}$$ is the count of variable $$i$$ taking value $$j$$ with parent configuration $$\pi_i$$. $$N'$$ represents the counts in the prior respectively.
Expand All @@ -40,51 +40,51 @@ Notice there is no penalty term appending to the BD score due to that it will pe

### Chow-Liu Algorithm

The Chow-Liu Algorithm is a specific type of score based approach. The Chow-Liu algorithm finds the maximum-likelihood tree structure where each node has at most one parent. Note that here our score is simply the maximum likelihood, we do not need to penalize the complexity since we are already limiting complexity by restricting ourselves to tree structures.
The Chow-Liu Algorithm is a specific type of score based approach which finds the maximum-likelihood tree-structured graph (i.e., each node has exactly one parent, except for parentless a root node). The score is simply the log-likelihood; there is no penalty term for graph structure complexity since the algorithm only considers tree structures.

The algorithm has three steps:

1) Compute the mutual information for all pairs of variables $$X,U$$, and form the mutual information graph where the edge between variables $$X,U$$ has weight $$MI(X,U)$$:
1. Compute the mutual information for all pairs of variables $$X,U$$, and form a complete graph from the variables where the edge between variables $$X,U$$ has weight $$MI(X,U)$$:

$$
MI(X,U) =\sum_{x,u} \hat p(x,u)\log\left[\frac{\hat{p} (x,u)}{\hat p(x) \hat p(u)}\right]
$$
$$
MI(X,U) =\sum_{x,u} \hat p(x,u)\log\left[\frac{\hat p(x,u)}{\hat p(x) \hat p(u)}\right]
$$

This function measures how much information $$U$$ provides about $$X$$. The graph with computed MI edge weights might resemble:
{% include maincolumn_img.html src='assets/img/mi-graph.png' %}
Remember that from our empirical distribution $$\hat p(x,u) = \frac{Count(x,u)}{\# \text{ data points}}$$.
This function measures how much information $$U$$ provides about $$X$$. The graph with computed MI edge weights might resemble:

{% include maincolumn_img.html src='assets/img/mi-graph.png' %}

Remember that from our empirical distribution $$\hat p(x,u) = \frac{Count(x,u)}{\# \text{ data points}}$$.

2) Find the **maximum** weight spanning tree: the maximal-weight tree that connects all vertices in a graph. This can be found using Kruskal or Prim Algorithms.
2. Find the **maximum** weight spanning tree: the maximal-weight tree that connects all vertices in a graph. This can be found using Kruskal or Prim Algorithms.

{% include maincolumn_img.html src='assets/img/max-spanning-tree.png' %}
{% include maincolumn_img.html src='assets/img/max-spanning-tree.png' %}

3) Pick any node to be the *root variable*, and assign directions radiating outward from this node (arrows go away from it). This step transforms the resulting undirected tree to a directed one.
3. Pick any node to be the *root variable*, and assign directions radiating outward from this node (arrows go away from it). This step transforms the resulting undirected tree to a directed one.

{% include maincolumn_img.html src='assets/img/chow-liu-tree.png' %}
{% include maincolumn_img.html src='assets/img/chow-liu-tree.png' %}

The Chow-Liu Algorithm has a complexity of order $$n^2$$, as it takes $$O(n^2)$$ to compute mutual information for all pairs, and $$O(n^2)$$ to compute the maximum spanning tree.

Now that we have described the algorithm, lets explain why this works. It turns out that the likelihood score decomposes into mutual information and entropy terms:
Having described the algorithm, let's explain why this works. It turns out that the likelihood score decomposes into mutual information and entropy terms:

$$
\log p(\mathcal D\mid \theta^{ML},G) = |\mathcal D| \sum_i MI_{\hat p}(X_i,X_{pa(i)}) - |\mathcal D| \sum_i H_{\hat p}(X_i)
\log p(\mathcal D \mid \theta^{ML}, G) = |\mathcal D| \sum_i MI_{\hat p}(X_i, X_{pa(i)}) - |\mathcal D| \sum_i H_{\hat p}(X_i).
$$

We would like to find a graph $$G$$ that maximizes this log-likelihood. Since the entropies are independent of the dependency ordering in the tree, the only terms that change with choice of $$G$$ are the mutual information terms. So we want
We would like to find a graph $$G$$ that maximizes this log-likelihood. Since the entropies are independent of the dependency ordering in the tree, the only terms that change with the choice of $$G$$ are the mutual information terms. So we want

$$
\arg\max_G \log P(\mathcal D\mid \theta^{ML}(G),G) = \arg\max_G\sum_i MI(X_i,X_{pa(i)})
\arg\max_G \log P(\mathcal D \mid \theta^{ML}, G) = \arg\max_G \sum_i MI(X_i, X_{pa(i)}).
$$

Now if we assume $$G = (V,E)$$ is a tree where each node has at most one parent, we get
Now if we assume $$G = (V,E)$$ is a tree (where each node has at most one parent), then

$$
\arg\max_{G:G\text{ is tree}} \log P(\mathcal D\mid \theta^{ML}(G),G) = \arg\max_{G:G\text{ is tree}}\sum_{(i,j)\in E} MI(X_i,X_j)
\arg\max_{G:G\text{ is tree}} \log P(\mathcal D \mid \theta^{ML}, G) = \arg\max_{G:G\text{ is tree}} \sum_{(i,j)\in E} MI(X_i,X_j).
$$

Note that the orientations of edges do not matter because mutual information is symmetric. Thus we can see why the Chow-Liu algorithm finds the best approximate tree structure where nodes are restricted to have at most one parent.
The orientation of edges does not matter because mutual information is symmetric. Thus we can see why the Chow-Liu algorithm finds a tree-structured that maximizes the log-likelihood of the data.

### Search algorithms

Expand Down