Merge pull request #207 from ermongroup/fix-struct-learn

Improve structure learning notes
ermongroup · Jul 27, 2021 · 49e87aa · 49e87aa
2 parents 8758f4a + a28507f
commit 49e87aa
Show file tree

Hide file tree

Showing 2 changed files with 29 additions and 29 deletions.
diff --git a/_layouts/post.html b/_layouts/post.html
@@ -1,7 +1,7 @@
 ---
 layout: default
 ---
-<h1>{{ page.title | capitalize }}</h1>
+<h1>{{ page.title }}</h1>
 <p class="subtitle">{{ page.date | date: "%B %-d, %Y" }}</p>
 
 

diff --git a/learning/structure/index.md b/learning/structure/index.md
@@ -1,22 +1,21 @@
 ---
 layout: post
-title: Structure Learning for Bayesian Networks
+title: Structure learning for Bayesian networks
 ---
-## Structure learning for Bayesian networks
 
-The task of structure learning for Bayesian networks refers to learn the structure of the directed acyclic graph (DAG) from data. There are two major approaches for the structure learning: score-based approach and constraint-based approach .
+The task of structure learning for Bayesian networks refers to learning the structure of the directed acyclic graph (DAG) from data. There are two major approaches for structure learning: score-based and constraint-based.
 
 ### Score-based approach
 
-The score-based approach first defines a criterion to evaluate how well the Bayesian network fits the data, then searches over the space of DAGs for a structure with maximal score. In this way, the score-based approach is essentially a search problem and consists of two parts: the definition of score metric and the search algorithm.
+The score-based approach first defines a criterion to evaluate how well the Bayesian network fits the data, then searches over the space of DAGs for a structure achieving the maximal score. The score-based approach is essentially a search problem that consists of two parts: the definition of a score metric and the search algorithm.
 
 ### Score metrics
 
 The score metrics for a structure $$\mathcal{G}$$ and data $$D$$ can be generally defined as:
 
-$$ Score(G:D)= LL(G:D) - \phi(|D|) \|G\|. $$
+$$ Score(G:D) = LL(G:D) - \phi(|D|) \|G\|. $$
 
-Here $$LL(G:D)$$ refers to the log-likelihood of the data under the graph structure $$\mathcal{G}.$$ The parameters in Bayesian network $$G$$ are estimated based on MLE and the log-likelihood score is calculated based on the estimated parameters. If we consider only the log-likelihood in the score function, we will end up with an overfitting structure (namely, a complete graph.) That is why we have the second term in the scoring function. $$\lvert D \rvert$$ is the number of sample and $$\|G\|$$ is the number of parameters in the graph $$\mathcal{G}$$. With this extra term, we will penalize the over-complicated graph structure and avoid overfitting. For AIC the function $$\phi(t) = 1, $$ while for BIC $$\phi(t) = \log(t)/2.$$ It is important to note that in BIC, the influence of model complexity will decrease as M grows, allowing the log-likelihood term to eventually dominate the score.
+Here $$LL(G:D)$$ refers to the log-likelihood of the data under the graph structure $$\mathcal{G}$$. The parameters in the Bayesian network $$G$$ are estimated based on MLE and the log-likelihood score is calculated based on the estimated parameters. If the score function only consisted of the log-likelihood term, then the optimal graph would be a complete graph, which is probably overfitting the data. Instead, the second term $$\phi(|D|) \|G\|$$ in the scoring function serves as a regularization term, favoring simpler models. $$\lvert D \rvert$$ is the number of data samples, and $$\|G\|$$ is the number of parameters in the graph $$\mathcal{G}$$. When $$\phi(t) = 1$$, the score function is known as the Akaike Information Criterion (AIC). When $$\phi(t) = \log(t)/2$$, the score function is known as the Bayesian Information Criterion (BIC). With the BIC, the influence of model complexity decreases as $$\lvert D \rvert$$ grows, allowing the log-likelihood term to eventually dominate the score.
 
 There is another family of Bayesian score function called BD (Bayesian Dirichlet) score. For BD score, if first defines the probability of data $$D$$ conditional on the graph structure $$\mathcal{G}$$ as
 
@@ -27,7 +26,8 @@ $$
 where $$P(D \mid \mathcal{G}, \Theta_{\mathcal{G}})$$ is the probability of the data given the network structure and parameters, and $$P(\Theta_{\mathcal{G}} \mid \mathcal{G})$$ is the prior probability of the parameters. When the prior probability is specified as a Dirichlet distribution,
 
 $$
-P(D|\Theta_{\mathcal{G}}) = \prod_{i} \prod_{\pi_i} \left[ \frac{\Gamma(\sum_j N'_{i,\pi_i,j})}{\Gamma(\sum_j N'_{i,\pi_i,j} + N_{i,\pi_i,j} )} \prod_{j}\frac{\Gamma(N'_{i,\pi_i,j} + N_{i,\pi_i,j})}{\Gamma(N'_{i,\pi_i,j})}\right].
+P(D|\Theta_{\mathcal{G}})
+= \prod_i \prod_{\pi_i} \left[ \frac{\Gamma(\sum_j N'_{i,\pi_i,j})}{\Gamma(\sum_j N'_{i,\pi_i,j} + N_{i,\pi_i,j} )} \prod_{j}\frac{\Gamma(N'_{i,\pi_i,j} + N_{i,\pi_i,j})}{\Gamma(N'_{i,\pi_i,j})}\right].
 $$
 
 Here $$\pi_i$$ refers to the parent configuration of the variable $$i$$ and $$N_{i,\pi_i,j}$$ is the count of variable $$i$$ taking value $$j$$ with parent configuration $$\pi_i$$. $$N'$$ represents the counts in the prior respectively.
@@ -40,51 +40,51 @@ Notice there is no penalty term appending to the BD score due to that it will pe
 
 ### Chow-Liu Algorithm
 
-The Chow-Liu Algorithm is a specific type of score based approach. The Chow-Liu algorithm finds the maximum-likelihood tree structure where each node has at most one parent. Note that here our score is simply the maximum likelihood, we do not need to penalize the complexity since we are already limiting complexity by restricting ourselves to tree structures.
+The Chow-Liu Algorithm is a specific type of score based approach which finds the maximum-likelihood tree-structured graph (i.e., each node has exactly one parent, except for parentless a root node). The score is simply the log-likelihood; there is no penalty term for graph structure complexity since the algorithm only considers tree structures.
 
 The algorithm has three steps:
 
-1) Compute the mutual information for all pairs of variables $$X,U$$, and form the mutual information graph where the edge between variables $$X,U$$ has weight $$MI(X,U)$$:
+1. Compute the mutual information for all pairs of variables $$X,U$$, and form a complete graph from the variables where the edge between variables $$X,U$$ has weight $$MI(X,U)$$:
 
- $$
- MI(X,U) =\sum_{x,u} \hat p(x,u)\log\left[\frac{\hat{p} (x,u)}{\hat p(x) \hat p(u)}\right]
- $$
+    $$
+    MI(X,U) =\sum_{x,u} \hat p(x,u)\log\left[\frac{\hat p(x,u)}{\hat p(x) \hat p(u)}\right]
+    $$
 
- This function measures how much information $$U$$ provides about $$X$$. The graph with computed MI edge weights might resemble:
- 
- {% include maincolumn_img.html src='assets/img/mi-graph.png' %}
- 
- Remember that from our empirical distribution $$\hat p(x,u) = \frac{Count(x,u)}{\# \text{ data points}}$$.
+    This function measures how much information $$U$$ provides about $$X$$. The graph with computed MI edge weights might resemble:
+
+    {% include maincolumn_img.html src='assets/img/mi-graph.png' %}
+
+    Remember that from our empirical distribution $$\hat p(x,u) = \frac{Count(x,u)}{\# \text{ data points}}$$.
 
-2) Find the **maximum** weight spanning tree: the maximal-weight tree that connects all vertices in a graph. This can be found using Kruskal or Prim Algorithms.
+2. Find the **maximum** weight spanning tree: the maximal-weight tree that connects all vertices in a graph. This can be found using Kruskal or Prim Algorithms.
 
- {% include maincolumn_img.html src='assets/img/max-spanning-tree.png' %}
+    {% include maincolumn_img.html src='assets/img/max-spanning-tree.png' %}
 
-3) Pick any node to be the *root variable*, and assign directions radiating outward from this node (arrows go away from it). This step transforms the resulting undirected tree to a directed one.
+3. Pick any node to be the *root variable*, and assign directions radiating outward from this node (arrows go away from it). This step transforms the resulting undirected tree to a directed one.
 
-{% include maincolumn_img.html src='assets/img/chow-liu-tree.png' %}
+    {% include maincolumn_img.html src='assets/img/chow-liu-tree.png' %}
 
 The Chow-Liu Algorithm has a complexity of order $$n^2$$, as it takes $$O(n^2)$$ to compute mutual information for all pairs, and $$O(n^2)$$ to compute the maximum spanning tree.
 
-Now that we have described the algorithm, lets explain why this works. It turns out that the likelihood score decomposes into mutual information and entropy terms:
+Having described the algorithm, let's explain why this works. It turns out that the likelihood score decomposes into mutual information and entropy terms:
 
 $$
-\log p(\mathcal D\mid \theta^{ML},G) = |\mathcal D| \sum_i MI_{\hat p}(X_i,X_{pa(i)}) - |\mathcal D| \sum_i H_{\hat p}(X_i)
+\log p(\mathcal D \mid \theta^{ML}, G) = |\mathcal D| \sum_i MI_{\hat p}(X_i, X_{pa(i)}) - |\mathcal D| \sum_i H_{\hat p}(X_i).
 $$
 
-We would like to find a graph $$G$$ that maximizes this log-likelihood. Since the entropies are independent of the dependency ordering in the tree, the only terms that change with choice of $$G$$ are the mutual information terms. So we want
+We would like to find a graph $$G$$ that maximizes this log-likelihood. Since the entropies are independent of the dependency ordering in the tree, the only terms that change with the choice of $$G$$ are the mutual information terms. So we want
 
 $$
-\arg\max_G \log P(\mathcal D\mid \theta^{ML}(G),G) = \arg\max_G\sum_i MI(X_i,X_{pa(i)})
+\arg\max_G \log P(\mathcal D \mid \theta^{ML}, G) = \arg\max_G \sum_i MI(X_i, X_{pa(i)}).
 $$
 
-Now if we assume $$G = (V,E)$$ is a tree where each node has at most one parent, we get
+Now if we assume $$G = (V,E)$$ is a tree (where each node has at most one parent), then
 
 $$
-\arg\max_{G:G\text{ is tree}} \log P(\mathcal D\mid \theta^{ML}(G),G) = \arg\max_{G:G\text{ is tree}}\sum_{(i,j)\in E} MI(X_i,X_j)
+\arg\max_{G:G\text{ is tree}} \log P(\mathcal D \mid \theta^{ML}, G) = \arg\max_{G:G\text{ is tree}} \sum_{(i,j)\in E} MI(X_i,X_j).
 $$
 
-Note that the orientations of edges do not matter because mutual information is symmetric. Thus we can see why the Chow-Liu algorithm finds the best approximate tree structure where nodes are restricted to have at most one parent.
+The orientation of edges does not matter because mutual information is symmetric. Thus we can see why the Chow-Liu algorithm finds a tree-structured that maximizes the log-likelihood of the data.
 
 ### Search algorithms