-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy path3_4_ST-MCMC.tex
1655 lines (921 loc) · 43.2 KB
/
3_4_ST-MCMC.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%\documentclass[mathserif]{beamer}
\documentclass[handout]{beamer}
%\usetheme{Goettingen}
\usetheme{Warsaw}
%\usetheme{Singapore}
%\usetheme{Frankfurt}
%\usetheme{Copenhagen}
%\usetheme{Szeged}
%\usetheme{Montpellier}
%\usetheme{CambridgeUS}
%\usecolortheme{}
%\setbeamercovered{transparent}
\usepackage[english, activeacute]{babel}
\usepackage[utf8]{inputenc}
\usepackage{amsmath, amssymb}
\usepackage{dsfont}
\usepackage{graphics}
\usepackage{cases}
\usepackage{graphicx}
\usepackage{pgf}
\usepackage{epsfig}
\usepackage{amssymb}
\usepackage{multirow}
\usepackage{amstext}
\usepackage[ruled,vlined,lined]{algorithm2e}
\usepackage{amsmath}
\usepackage{epic}
\usepackage{epsfig}
\usepackage{fontenc}
\usepackage{framed,color}
\usepackage{palatino, url, multicol}
\usepackage{listings}
%\algsetup{indent=2em}
\vspace{-0.5cm}
\title{Markov Chain Monte Carlo}
\vspace{-0.5cm}
\author[Felipe Bravo Márquez]{\footnotesize
%\author{\footnotesize
\textcolor[rgb]{0.00,0.00,1.00}{Felipe José Bravo Márquez}}
\date{ \today }
\begin{document}
\begin{frame}
\titlepage
\end{frame}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{frame}{Markov Chain Monte Carlo}
\scriptsize{
\begin{itemize}
\item This class introduces estimation of posterior probability distributions using a stochastic process known
as \textbf{Markov chain Monte Carlo} (MCMC).
\item Here we'll produce samples from the joint posterior without maximizing anything.
\item We will be able to sample directly from the posterior without assuming a Gaussian, or any other, shape.
\item The cost of this power is that it may take much longer for our estimation to complete.
\item But the benefit is escaping multivariate normality assumption of the Laplace approximation.
\item This class is based on Chapter 9 of \cite{mcelreath2020statistical} and Chapter 7 of \cite{kruschke2014doing}.
\end{itemize}
}
\end{frame}
\begin{frame}{Markov Chain Monte Carlo}
\scriptsize{
\begin{itemize}
\item More advanced models such as generalized and multilevel models tend to produce non-Gaussian posterior distributions.
\item In most cases they cannot be estimated at all with the techniques of earlier classes.
\item The essence of MCMC is to produce samples from the posterior $f(\theta|d)$ by only accessing a function that is proportial to it.
\item This proportial function is the product of the likelihood and the prior $f(d|\theta)*f(\theta)$, which is always available in a Bayesian model.
\end{itemize}
}
\end{frame}
\begin{frame}{Markov Chain Monte Carlo}
\scriptsize{
\begin{itemize}
\item So, merely by evaluating $f(d|\theta)*f(\theta)$, without normalizing it by $f(d)$, MCMC allows us to generate random representative values from the posterior distribution.
\item This property is wonderful because the method obviates direct computation of the evidence $f(d)$.
\item Which is one of the most difficult aspects of Bayesian inference.
\item It has only been with the development of MCMC algorithms an software that Bayesian inference is applicable to complex data analysis.
\item And it has only been with the production of fast and cheap computer hardware that Bayesian inference is accessible to a wide audience.
\item The question then becomes this: How does MCMC work?
\item For an answer, let's ask a politician.
\end{itemize}
}
\end{frame}
\begin{frame}{A politician stumbles upon the Metropolis algorithm}
\scriptsize{
\begin{itemize}
\item Suppose an elected politician lives on a long chain of islands.
\item He is constantly traveling from island to island, wanting to stay in the public eye.
\item At the end of a day he has to decide whether to:
\begin{enumerate}
\scriptsize{
\item stay on the current island
\item move to the adjacent island to the left
\item move to the adjacent island to right }
\end{enumerate}
\item His goal is to visit all the islands \textbf{proportionally} to their \textbf{relative population}.
\item But, he doesn't know the total population of all the islands.
\item He only knows the population of the current island where he is located.
\item He can also ask about the population of an adjacent island to which he plans to move.
\end{itemize}
}
\end{frame}
\begin{frame}{The Metropolis algorithm}
\scriptsize{
\begin{itemize}
\item The politician has a simple heuristic for travelling accross the islands called the \textbf{Metropolis} algorithm \cite{metropolis1953equation}.
\item First, he flips a (fair) coin to decide whether to propose the adjacent island to the left or the adjacent island to the right.
\item If the proposed island has a larger population than
the current island ($P_{proposed}>P_{current}$), then he goes to the proposed island.
\item If the proposed island has a smaller population than the current island ($P_{proposed}<P_{current}$), then he goes to the proposed island with probability $\mathbb{P}_{move}=P_{proposed}/P_{current}$.
\item In the long run, the probability that the politician is on any one of the islands exactly matches the relative population of the island!
\end{itemize}
}
\end{frame}
\begin{frame}{The Metropolis algorithm}
\scriptsize{
\begin{itemize}
\item Let's analyze the Metropolis algorithm in more detail.
\item Suppose there are 10 islands in total.
\item Each island is neighbored by two others, and the
entire archipelago forms a ring.
\item The islands are of different sizes, and so had different sized populations living on them.
\item The second island is twice as populous as the first,
the third three times as populous as the first.
\item And so on, up to the largest island, which is 10 times as populous as the smallest.
\end{itemize}
}
\end{frame}
\begin{frame}{The Metropolis algorithm}
\begin{figure}[h!]
\centering
\includegraphics[scale=0.5]{pics/islands.png}
\end{figure}
\end{frame}
\begin{frame}{The Metropolis algorithm}
\scriptsize{
\begin{itemize}
\item We are going to show an implementation of this algorithm in R.
\item But before that, we will combine combine the two possibilities for the probability of moving into a single expression: the proposed island having a 1) higher or 2) lower population than the current island.
\begin{equation}
\mathbb{P}_{move}=\min(1,P_{proposed}/P_{current}).
\end{equation}
\item So, if $P_{proposed}>P_{current}$, $P_{proposed}/P_{current}>1$ and $\mathbb{P}_{move}=1$.
\item For example, $current=4$ and $proposed=5$, $5/4>1$ so we move to the proposed island (with probability 1).
\item On the other hand, if $P_{proposed}<P_{current}$, $P_{proposed}/P_{current}<1$, and $\mathbb{P}_{move}=P_{proposed}/P_{current}$.
\item For example, $current=4$ and $proposed=3$, $3/4<1$ so we move to the proposed island with probability $3/4$.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{The Metropolis algorithm}
\scriptsize{
\begin{verbatim}
num_days <- 1e5
positions <- rep(0,num_days)
current <- 10
for ( i in 1:num_days ) {
# record current position
positions[i] <- current
# flip coin to generate proposal
proposal <- current + sample( c(-1,1) , size=1 )
# now make sure he loops around the archipelago
if ( proposal < 1 ) proposal <- 10
if ( proposal > 10 ) proposal <- 1
# move?
prob_move <- min(proposal/current,1)
decision <- rbinom(1,1,prob_move)
current <- ifelse( decision == 1 , proposal , current )
}
library(rethinking)
simplehist(positions,xlab="island",ylab="number of days")
\end{verbatim}
}
\end{frame}
\begin{frame}{The Metropolis algorithm}
\begin{figure}[h!]
\centering
\includegraphics[scale=0.92]{pics/mcmc_islands.pdf}
\end{figure}
\scriptsize{
The time spent on each island is proportional to its population size.}
\end{frame}
\begin{frame}{The Metropolis algorithm}
\scriptsize{
\begin{itemize}
\item The first three lines of the method just define
the number of days to simulate, an empty history vector, and a starting island position (the biggest island, number 10).
\item Then the for loop steps through the days.
\item Each day, it records the politician's current position.
\item Then it simulates a coin flip to nominate a proposal island.
\item The only trick here lies in making sure that a proposal of ``11'' loops around to island 1 and a proposal of “0” loops around to island 10.
\item Finally, a random binary number is generated with a Bernoulli distribution (Binomial with 1 trial) with probability of success (or moving)$=\min(1,P_{proposed}/P_{current})$.
\item If this random number is 1 we move, otherwise we stay.
\end{itemize}
}
\end{frame}
\begin{frame}{The Metropolis algorithm}
\scriptsize{
\begin{itemize}
\item In real applications, the goal is not to help a politician, but instead to draw samples from an unknown and usually complex posterior probability distribution.
\item The ``islands'' in our objective are parameter values $\theta$, and they need not be discrete, but can instead take on a continuous range of values as usual.
\item The ``population sizes'' in our objective are the posterior probabilities (or densities) at each parameter value: $f(\theta|d)$
\item The ``days'' in our objective are samples taken from the posterior distribution.
\item The coin flip represents the \textbf{proposal distribution} $q(\theta)$, which for continuous parameters is usually replaced by a Gaussian distribution.
\item The Metropolis algorithm will eventually give us a collection of samples from the posterior.
\item We can then use these samples just like all the samples we have already used in this course.
\end{itemize}
}
\end{frame}
\begin{frame}{Why it works}
\scriptsize{
\begin{itemize}
\item Now, let's try to understand why the algorithm works.
\item We are going to denote our target probability density from which we want to draw samples as $p(\theta)$.
\item Bear in mind that $p(\theta)$ is usually a posterior density $f(\theta|d)$.
\item Consider two adjacent positions and the probabilities of moving from one to the other.
\item We'll see that the relative transition probabilities, between adjacent positions, exactly match the relative values of the target density $p(\theta)$.
\item Extrapolate that result across all the positions, and you can see that, in the long run, each position will be visited proportionally to its target value.
\item Suppose we are at position $\theta$.
\end{itemize}
}
\end{frame}
\begin{frame}{Why it works}
\scriptsize{
\begin{itemize}
\item The probability of moving to $\theta+ 1$, denoted
$\mathbb{P}(\theta \rightarrow \theta + 1)$, is the probability of proposing that move times the probability of
accepting it if proposed, which is:
\begin{displaymath}
\mathbb{P}(\theta \rightarrow \theta + 1) = 0.5 \times \min(p(\theta+ 1)/p(\theta),1)
\end{displaymath}
\item On the other hand, if we are presently at position $\theta+ 1$, the probability of moving to $\theta$ is:
\begin{displaymath}
\mathbb{P}(\theta+1 \rightarrow \theta) = 0.5 \times \min(p(\theta)/p(\theta+1),1)
\end{displaymath}
\item The ratio of the transition probabilities is:
\begin{align}
\frac{\mathbb{P}(\theta \rightarrow \theta+1)}{\mathbb{P}(\theta +1 \rightarrow \theta)} & = \frac{0.5 \times \min(p(\theta+1)/p(\theta),1) }{0.5 \times \min(p(\theta)/p(\theta+1),1) }\\
& = \begin{cases}
\frac{1}{p(\theta)/p(\theta+1)} & \text{if} p(\theta+1) > p(\theta)\\ \\
\frac{p(\theta+1)/p(\theta)}{1} & \text{if} p(\theta+1) < p(\theta)
\end{cases} \\
& = \frac{p(\theta+1)}{p(\theta)}
\end{align}
\end{itemize}
}
\end{frame}
\begin{frame}{Why it works}
\scriptsize{
\begin{itemize}
\item The last equation tells us that during transitions back and forth between adjacent positions, the relative probability of the transitions exactly matches the relative values of the target distribution.
\item That might be enough to get the intuition that, in the
long run, adjacent positions will be visited proportionally to their relative values in the target distribution.
\item If that's true for adjacent positions, then, by extrapolating from one position to the next, it must be true for the whole range of positions.
\item In more mathematical terms, this means that the transition probabilities form a Markov chain that has the target distribution as its equilibrium or stationary distribution. \cite{wiki:Markov_chain_Monte_Carlo}
\item Hence, one can obtain a sample of the desired distribution by recording states from the chain.
\end{itemize}
}
\end{frame}
\begin{frame}{The Metropolis Algorithm more Generally}
\scriptsize{
\begin{itemize}
\item So far, we have only considered the case with a single discrete parameter $\theta$ that can only move to the left or right.
\item The general Metropolis algorithm allows working with multiple continuous parameters $\vec{\theta}=\{\theta_1,\theta_2,\dots,\theta_m\}$ and more general proposal distributions.
\item The essentials of the general method are the same as for the simple case.
\item First, we have some target density $p(\theta)$ ($\theta$ can be a vector of parameters $\vec{\theta}$) from which we would like to generate representative sample values:
\begin{displaymath}
\theta^{(1)},\theta^{(2)},\dots,\theta^{(t)}, \dots, \theta^{(k)}
\end{displaymath}
\item Note that each parameter sample $\theta^{(t)}$ can be a multidimensional vector $\vec{\theta}^{(t)}$ if we have multiple parameters.
\item We must be able to compute the value of $p(\theta)$ for any candidate value of $\theta$.
\item The density $p(\theta)$, does not have to be normalized, however.
\item Just needs needs to be nonnegative.
\end{itemize}
}
\end{frame}
\begin{frame}{The Metropolis Algorithm more Generally}
\scriptsize{
\begin{itemize}
\item In our Bayesian inference application $p(\theta)$ is the unnormalized posterior density on $\theta$, which is the product of the likelihood and the prior: $f(d|\theta)*f(\theta)$.
\item This is a very important property of MCMC, as it allows us to draw samples from the posterior without having to calculate the evidence $f(d)$.
\item Sample values from the target distribution $\theta^{(1)},\theta^{(2)},\dots,\theta^{(k)}$ are generated by taking a random walk through the parameter space.
\item The proposal distribution $q(\theta)$ aims to explore $p(\theta)$ allowing us to visit any parameter value on the continuum.
\item The proposals from $q(\theta)$ should go a reasonable distance in the parameter space (otherwise the random walk moves too slowly).
\item They should also not be rejected too frequently (otherwise the random walk wastes too much time standing still) \cite{gelman2013bayesian}.
\end{itemize}
}
\end{frame}
\begin{frame}{The Metropolis Algorithm more Generally}
\scriptsize{
\begin{itemize}
\item The generic case for $q(\theta)$ is using a Gaussian distribution centered at the current position $\theta^{(t-1)}$.
\begin{displaymath}
q(\theta^{(t)}|\theta^{(t-1)})=N(\theta^{(t-1)},\sigma^2)
\end{displaymath}
where $\sigma$ is a user-specified parameter.
\item So the proposed move will typically be near the current position, with the probability of proposing a more distant position dropping off according to the normal curve.
\item For multivariate target distributions $p(\vec{\theta})$, we can use a Multi-variate Gaussian to propose multi-dimensional points in each step.
\begin{displaymath}
q(\vec{\theta}^{(t)}|\vec{\theta}^{(t-1)})=N(\vec{\theta}^{(t-1)},\Sigma)
\end{displaymath}
\end{itemize}
}
\end{frame}
\begin{frame}{The Metropolis Algorithm more Generally}
\scriptsize{
\begin{itemize}
\item A general form of the Metropolis algorithm as described in \cite{gelman2013bayesian} is given below:
\end{itemize}
\begin{block}{The Metropolis Algorithm}
\begin{enumerate}
\item Draw a starting point $\theta^0$ for which $p(\theta^0)>0$.
\item For $t=1,2,\dots,k:$
\begin{enumerate}
\scriptsize{
\item Sample a proposal $\theta^{*}$ from the proposal distribution at time $t$: $q(\theta^{*}|\theta^{(t-1)})$
\item Calculate the ratio of the densities:
\begin{equation}\label{eq:ratio}
r = \frac{p(\theta^*)}{p(\theta^{t-1})}
\end{equation}
\item Set
\begin{equation}
\theta^{(t)} =
\begin{cases}
\theta^{*} & \text{with probability} \min(r,1)\\
\theta^{(t-1)} & \text{otherwise}
\end{cases}
\end{equation}
\item Return $\theta^{(1)},\dots,\theta^{(k)}$.
}
\end{enumerate}
\end{enumerate}
\end{block}
If the proposed value $\theta^{*}$ happens to land outside the permissible bounds of $\theta$ (e.g., when $0\leq \theta \leq1$), $p(\theta^*)$ is set to zero, hence $\theta^{(t)}=\theta^{(t-1)}$.
}
\end{frame}
\begin{frame}[fragile]{The Metropolis-Hastings Algorithm}
\scriptsize{
\begin{itemize}
\item The Metropolis algorithm works whenever the probability of proposing a jump to B from A is equal to the probability of proposing A from B, when the proposal distribution is symmetric (such as a Gaussian distribution).
\begin{displaymath}
q(\theta^{(a)}|\theta^{(b)}) = q(\theta^{(b)}|\theta^{(a)})
\end{displaymath}
\item We can easily check this using R:
\begin{verbatim}
> a<-0.5
> b<-0.3
> dnorm(x=a,mean=b)
[1] 0.3910427
> dnorm(x=b,mean=a)
[1] 0.3910427
\end{verbatim}
\item There is a more general method, known as Metropolis-Hastings, that allows asymmetric proposals (i.e., $q(\theta^{(a)}|\theta^{(b)}) \neq q(\theta^{(b)}|\theta^{(a)}$).
\item This would mean, that the politician's coin were biased to lead him clockwise on average.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{The Metropolis-Hastings Algorithm}
\scriptsize{
\begin{itemize}
\item To correct for the asymmetry in the proposals, the ratio $r$ in Equation~\ref{eq:ratio} is replaced by a ratio of ratios:
\begin{equation}\label{eq:r_hastings}
r = \frac{p(\theta^*)/q(\theta^{*}|\theta^{(t-1)})}{p(\theta^{t-1})/q(\theta^{(t-1)}|\theta^{*})}
\end{equation}
\item Asymmetric proposal distributions allows us to explore the posterior distribution more efficiently (i.e., acquire a good image of the posterior distribution in fewer steps).
\item The goal is that regions with high probability density in $p(\theta)$ to be proposed more frequently than regions with low probability density.
\item The ideal proposal distribution $q(\theta^{*}|\theta^{(t-1)})$ would be the target distribution $p(\theta^{*})$.
\item Then the ratio $r$ in Equation~\ref{eq:r_hastings} is always exactly 1.
\item However, in practical settings is not possible so sample directly from $p(\theta^{*})$, which indeed is the reason why we use MCMC methods.
\end{itemize}
}
\end{frame}
\begin{frame}{Gibbs Sampling}
\scriptsize{
\begin{itemize}
\item Gibbs sampling is a variant of the Metropolis-Hastings algorithm that uses clever proposals and is therefore more efficient.
\item The improvement arises from adaptive proposals in which the distribution of proposed parameter values adjusts itself intelligently, depending upon the parameter values at the moment.
\item How Gibbs sampling computes these adaptive proposals depends upon using conjugate combinations of priors and likelihoods (such as the Beta and the Binomial).
\item As previously presented, conjugate combinations have analytical solutions for the posterior distribution of an individual parameter.
\item And these solutions are what allow Gibbs sampling to make smart jumps around the joint posterior distribution of all parameters.
\end{itemize}
}
\end{frame}
\begin{frame}{Gibbs Sampling}
\scriptsize{
\begin{itemize}
\item The algorithm works as follows:
\item First, we assume that our target parameter $\theta$ is multi-dimensional: $\vec{\theta}=\{\theta_1,\theta_2,\dots,\theta_m\}$
\item At each point in the walk, the parameters are selected in an iterative cycle: $\theta_1, \theta_2, \theta_3 , \dots \theta_1, \theta_2, \theta_3, \dots.$
\item Suppose that parameter $\theta_j$ has been selected.
\item Gibbs sampling then chooses a new value for that parameter by generating a random value directly from
the conditional probability distribution of that parameter given all the others:
\begin{displaymath}
q(\theta^{*}_j|\theta^{(t-1)}) = p(\theta^{*}_j | \theta^{(t-1)}_{-j})= p(\theta^{*}_j | \theta^{(t-1)}_1,\dots, \theta^{(t-1)}_{j-1}, \theta^{(t-1)}_{j+1}, \dots,\theta^{(t-1)}_m)
\end{displaymath}
\item Since we are using conjugate combinations, this conditional distribution has a closed form that facilitates the sampling of random numbers from it.
\end{itemize}
}
\end{frame}
\begin{frame}{Gibbs Sampling}
\scriptsize{
\begin{itemize}
\item The new value for $\theta_j^{(t)}$ , combined with the unchanged values of $\theta^{(t-1)}_1,\dots, \theta^{(t-1)}_{j-1}, \theta^{(t-1)}_{j+1}, \dots,\theta^{(t-1)}_m$, constitutes the new position in the random walk.
\begin{displaymath}
\{\theta^{(t)}_1,\dots, \theta^{(t)}_{j-1}, \theta^{(t)}_{j}, \theta^{(t)}_{j+1}, \dots,\theta^{(t)}_m\} = \{\theta^{(t-1)}_1,\dots, \theta^{(t-1)}_{j-1}, q(\theta^{*}_j|\theta^{(t-1)}), \theta^{(t-1)}_{j+1}, \dots,\theta^{(t-1)}_m\}
\end{displaymath}
\item The process then repeats: select the next parameter $\theta_{j+1}$ and select a new value for that parameter from its conditional posterior distribution.
\item Let's illustrate this process for a two-parameter example: $\theta_1,\theta_2$.
\item In the first step, we want to select a new value for $\theta_1^{(t)}$ .
\item We conditionalize on the values of all the other
parameters from the previous step in the chain.
\item In this example, there is only one other parameter, namely $\theta_2$.
\end{itemize}
}
\end{frame}
\begin{frame}{Gibbs Sampling}
\scriptsize{
\begin{figure}[h!]
\centering
\includegraphics[scale=0.38]{pics/gibbs1.png}
\end{figure}
\begin{itemize}
\item The figure shows a slice through the target joint distribution $p(\theta)$ at the current value of $\theta_2^{(t-1)}$.
\item The heavy curve is the target distribution conditional on this value of $\theta_2^{(t-1)}$.
\item This corresponds to $p(\theta_1|\theta_2^{(t-1)})$ in this case because there is only one other parameter.
\end{itemize}
}
\end{frame}
\begin{frame}{Gibbs Sampling}
\scriptsize{
\begin{itemize}
\item Because we are using conjugate distributions a computer can directly generate a random value of $\theta_1^{(t)}$ from $p(\theta_1|\theta_2^{(t-1)})$.
\item Having generated $\theta_1^{(t)}$, we then conditionalize on it and determine the conditional distribution of the next parameter, $\theta_2^{(t+1)}$ using $p(\theta_2|\theta_1^{(t)})$ as shown below:
\begin{figure}[h!]
\centering
\includegraphics[scale=0.4]{pics/gibbs2.png}
\end{figure}
\item We generate a new value $\theta_2^{(t+1)}$ , and the cycle repeats.
\end{itemize}
}
\end{frame}
\begin{frame}{Gibbs Sampling}
\scriptsize{
\begin{itemize}
\item Because the proposal distribution exactly mirrors the posterior probability for that parameter, the proposed move is always accepted.
\item The proof of this can be found on page on page 281 of \cite{gelman2013bayesian}, but essentially since $q(\theta^{*}_j|\theta^{(t-1)}) = p(\theta^{*}_j | \theta^{(t-1)}_{-j})$, the ratio $r$ from Equation~\ref{eq:r_hastings} becomes 1.
\item Hence, the algorithm is more efficient than the standard Metropolis algorithm in which proposals are rejected in many cases.
\item But there are some limitations to Gibbs sampling.
\item First, there are cases when we don't want to use conjugate priors.
\item Second, it can become inefficient with complex models containing hundreds, thousands or tens of thousands of parameters.
\end{itemize}
}
\end{frame}
\begin{frame}{Hamiltonian Monte Carlo}
\scriptsize{
\begin{itemize}
\item The Metropolis algorithm and Gibbs sampling are both highly random procedures.
\item They try out new parameter values and see how good they are, compared to the current values.
\item But Gibbs sampling gains efficiency by reducing this randomness and exploiting knowledge of the target distribution.
\item Hamiltonian Monte Carlo (or Hybrid Monte Carlo, HMC) is another sampling method that is much more computationally costly than the others but its proposals are much more efficient.
\item As a result, it doesn't need as many samples to describe the posterior distribution.
\item And as models become more complex (thousands or tens of thousands of parameters) HMC can really outshine other algorithms.
\end{itemize}
}
\end{frame}
\begin{frame}{Hamiltonian Monte Carlo}
\scriptsize{
\begin{itemize}
\item HMC is a very complex algorithm and we won't get into the details of its inner workings
\item Let's try to understand it in a very superficial way by using again the politician's tale.
\item Suppose the politician has moved to the mainland now.
\item Now, instead of moving over a set of discrete islands, it has to move through a continuous territory stretched out along a narrow valley, running north-south.
\item The obligations are the same: to visit his citizens in proportion to their local density.
\item And again, the politician doesn't know the population of each area in advance.
\end{itemize}
}
\end{frame}
\begin{frame}{Hamiltonian Monte Carlo}
\scriptsize{
\begin{itemize}
\item The strategy of the politician is the following:
\item He drives his car across the narrow valley back and forth along its length.
\item In order to spend more time in densely settled areas, he slows down his vehicle when houses grow
more dense.
\item Likewise, he speeds up when houses grow more sparse.
\item This strategy requires knowing how quickly population density is changing, at their current location.
\item But it doesn't require remembering where they've been or knowing the population distribution anyplace else.
\item This story is analogous to how Hamiltonian Monte Carlo works.
\end{itemize}
}
\end{frame}
\begin{frame}{Hamiltonian Monte Carlo}
\scriptsize{
\begin{itemize}
\item In statistical applications, the politician's vehicle is the current vector of parameter values.
\item HMC really does run a physics simulation, pretending the vector of parameters gives the position of a little frictionless particle.
\item The log-posterior provides a surface for this particle to glide across.
\item Then the job is to sweep across this surface, adjusting speed (momentum) in proportion to how high up we are.
\item When the log-posterior is very flat, then the particle can glide for a long time before the slope (gradient) makes it turn around.
\item When instead the log-posterior is very steep, then the particle doesn’t get far before turning around.
\end{itemize}
}
\end{frame}
\begin{frame}{Hamiltonian Monte Carlo}
\begin{figure}[h!]
\centering
\includegraphics[scale=0.25]{pics/HMC.png}
\end{figure}
source: \cite{izmailov2021bayesian}
\end{frame}
\begin{frame}{Hamiltonian Monte Carlo}
\scriptsize{
\begin{itemize}
\item A big limitation of HMC is that it needs to be tuned to a particular model and its data.
\item Stan\footnote{\url{https://mc-stan.org/}} is a very popular platform for statistical modeling and high-performance statistical computation.
\item Stan automates much of that tuning.
\item Next, we will see how to use Stan to fit a linear model with an interaction.
\item Make sure that the \textbf{rstan} package is installed to access Stan from within R.
\end{itemize}
\begin{figure}[h!]
\centering
\includegraphics[scale=0.2]{pics/stan.png}
\end{figure}
}
\end{frame}
\begin{frame}{Easy HMC: ulam}
\scriptsize{
\begin{itemize}
\item Stan provides its own language to delcare models.
\item The rethinking package provides a convenient interface, \textbf{ulam}, to compile lists of formulas, like the lists we've been using so far to construct quap estimates, into Stan HMC
code.
\item All that ulam does is translate formulas into Stan models, and then Stan defines
the sampler and does the hard part.
\item Stan models look very similar, but require some more explicit definitions.
\item We will show them later.
\item Before using ulam we must preprocess any variable transformations (log, exp, etc).
\item We also must construct a clean data.frame with only the variables we will use.
\end{itemize}
}
\end{frame}
\begin{frame}{A model of ruggedness}
\scriptsize{
\begin{itemize}
\item We are are going to work with the \textbf{rugged} dataset from the rethinking package.
\item Each row in these data is a country, and the various columns are economic, geographic, and
historical features.
\item The variable rugged is a Terrain Ruggedness Index that quantifies the topographic heterogeneity of a landscape.
\item The outcome variable of our regression model will be the logarithm of real gross domestic product per capita, from the year 2000, \textbf{rgdppc\_2000}.