-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathlinalg.qmd
894 lines (720 loc) · 41.2 KB
/
linalg.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
---
engine: julia
---
# Linear algebra for linear models {#sec-linalg}
\newcommand\bbA{{\mathbf{A}}}
\newcommand\bbb{{\mathbf{b}}}
\newcommand\bbI{{\mathbf{I}}}
\newcommand\bbR{{\mathbf{R}}}
\newcommand\bbX{{\mathbf{X}}}
\newcommand\bbx{{\mathbf{x}}}
\newcommand\bby{{\mathbf{y}}}
\newcommand\bbbeta{{\boldsymbol{\beta}}}
\newcommand\bbeta{{\boldsymbol{\eta}}}
\newcommand\bbLambda{{\boldsymbol{\Lambda}}}
\newcommand\bbOmega{{\boldsymbol{\Omega}}}
\newcommand\bbmu{{\boldsymbol{\mu}}}
\newcommand\bbSigma{{\boldsymbol{\Sigma}}}
\newcommand\bbtheta{{\boldsymbol{\theta}}}
\newcommand\mcN{{\mathcal{N}}}
\newcommand\mcB{{\mathcal{B}}}
\newcommand\mcU{{\mathcal{U}}}
\newcommand\mcX{{\mathcal{X}}}
\newcommand\mcY{{\mathcal{Y}}}
\newcommand\mcZ{{\mathcal{Z}}}
Attach the packages to be used in this appendix
```{julia}
#| code-fold: true
using EmbraceUncertainty: dataset
using LinearAlgebra
using MixedModels
```
In this appendix we describe properties of the multivariate Gaussian (or "normal") distribution and how linear models and linear mixed models can be formulated in terms of this distribution.
We also describe some methods in numerical linear algebra that are particularly useful in working with linear models.
One of the strengths of the Julia language is the [LinearAlgebra](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/) package in the standard library.
The implementation of multi-dimensional *arrays*, including one-dimensional *vectors* and two-dimensional *matrices*, is part of the base Julia language.
The added value of the *LinearAlgebra* package is compact representations of special types of matrices and methods for and with matrix decompositions or [factorizations](https://nhigham.com/2022/05/18/the-big-six-matrix-factorizations/).
The purpose of these descriptions is to motivate a representation of a linear mixed model that allows for fast and stable estimation of the parameters.
The estimation process requires iterative optimization of some of the parameters in the model to minimize an objective function.
Often this optimization requires hundreds or thousands of evaluations of the objective at different values of the parameters and these evaluations dominate the overall estimation time.
Thus, a fast, efficient method for evaluating the objective is crucial to making the whole process fast.
## Matrix-vector representation of linear models
A linear statistical model is often written as an expression for each element of the $n$-dimensional response vector, $\bby$, as, e.g.
$$
y_i = \beta_1 x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_p x_{i,p} + \epsilon_i, \quad i=1,\dots, n
$$ {#eq-elementlinmod}
and some additional description like "where the $\epsilon_i,i=1,\dots,n$ are independently and identically distributed as $\mcN(0, \sigma^2)$".
An alternative is to write the model in terms of the $n$-dimensional *response vector*, $\bby$, an $n\times p$ *model matrix*, $\bbX$, and a $p$-dimensional *coefficient vector*, $\bbbeta$, as
$$
\mcY\sim\mcN\left(\bbX\bbbeta,\sigma^2\bbI\right),
$$ {#eq-mvnlinmod}
where $\mcN$ denotes the multivariate Gaussian distribution with mean $\bbmu=\bbX\bbbeta$ and variance-covariance matrix $\bbSigma=\sigma^2\bbI$.
(In what follows we will refer to the *variance-covariance matrix* as simply the *covariance matrix*.)
Before considering properties of and computational methods for the model @eq-mvnlinmod we will describe some of the properties of the [multivariate Gaussian distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution).
## The multivariate Gaussian distribution {#sec-multivariateGaussian}
Just as a univariate Gaussian distribution can be written by specifying the (scalar) mean, $\mu$, and the variance, $\sigma^2$, as $\mcN(\mu, \sigma^2)$, a multivariate Gaussian distribution is characterized by its $n$-dimensional mean vector, $\bbmu$, and its $n\times n$ variance-covariance matrix, $\bbSigma$, as $\mcN(\bbmu, \bbSigma)$.
The density function for a univariate Gaussian distribution is the familiar "bell curve"
$$
f(x; \mu, \sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(\frac{-\left(x-\mu\right)^2}{2\sigma^2}\right)
$$ {#eq-univariateGaussianpdf}
and probabilities defined by this density are most easily evaluated by *standardizing* the deviation, $x-\mu$, as $z=\frac{x-\mu}{\sigma}$, which is why $\sigma$ is called the *standard deviation*.
To be able to evaluate $\sigma$, the variance, $\sigma^2$, must be positive, or at least non-negative.
If $\sigma^2=0$ then all the probability is concentrated at a single point, $x=\mu$, and we no longer have a probability density, in the usual way of thinking of one.
The density shrinks to a point mass and the distribution is said to be [degenerate](https://en.wikipedia.org/wiki/Degenerate_distribution).
Similar constraints apply to a covariance matrix, $\bbSigma$.
Because the covariance of the i'th and j'th elements does not depend upon the order in which we write them, $\bbSigma$ must be symmetric.
That is,
$$
\bbSigma' = \bbSigma .
$$ {#eq-Sigmasym}
Furthermore, to define a proper multivariate density, $\bbSigma$ must be [positive definite](https://en.wikipedia.org/wiki/Definite_matrix), which means that for any non-zero vector, $\bbx$, the *quadratic form* defined by $\bbSigma$ must be positive.
That is
$$
\bbx'\bbSigma\bbx>0,\quad\forall\,\bbx\ne\mathbf{0} .
$$ {#eq-positiveDef}
(the symbol $\forall$ means "for all").
Positive definiteness implies that the *precision matrix*, $\bbSigma^{-1}$, exists and is also positive definite.
It also implies that there are "matrix square roots" of $\bbSigma$ in the sense that there are matrices $\mathbf{A}$ such that $\mathbf{A}'\mathbf{A}=\bbSigma$.
(The reason for writing $\mathbf{A}'\mathbf{A}$ and not simply the square of $\mathbf{A}$ is that $\mathbf{A}$ is not required to be symmetric but $\mathbf{A}'\mathbf{A}$ will be symmetric, even in $\mathbf{A}$ is not.)
One such "square root" of a positive definite $\bbSigma$ is the [Cholesky factor](https://en.wikipedia.org/wiki/Cholesky_decomposition), which corresponds to $n\times n$ upper-triangular matrix, $\bbR$, such that
$$
\bbSigma=\bbR'\bbR .
$$ {#eq-uppercholesky}
This factor is usually called $\bbR$ because it appears without the transpose as the right-hand multiplicant in @eq-uppercholesky.
An alternative expression is written with the lower-triangular $\mathbf{L}$ on the left as
$$
\bbSigma=\mathbf{L}\mathbf{L}',
$$ {#eq-lowerCholesky}
with the obvious relationship that $\mathbf{L}=\bbR'$.
To add to the confusion, the `cholesky` function in the *LinearAlgebra* package produces a factorization where the lower-triangular factor on the left is called `L` and the upper-triangular factor on the right is called `U`.
The factor $\bbR$ or $\mathbf{L}$ can be evaluated directly from the elements of $\bbSigma$.
For example, the non-zeros in the first two rows of $\mathbf{L}$ are evaluated as
$$
\begin{aligned}
\mathbf{L}_{1,1}&=\sqrt{\bbSigma_{1,1}}\\
\mathbf{L}_{2,1}&=\bbSigma_{2,1}/\mathbf{L}_{1,1}\\
\mathbf{L}_{2,2}&=\sqrt{\bbSigma_{2,2}-\mathbf{L}_{2,1}^2}
\end{aligned}
$$ {#eq-lowerCholtworows}
Evaluating the diagonal elements involves taking a square root.
By convention we choose the positive square root for the Cholesky factor with the result that the diagonal elements of $\mathbf{L}$ are all positive.
### Some properties of triangular matrices
A triangular matrix with non-zero diagonal elements is non-singular.
One way to show this is because its [determinant](https://en.wikipedia.org/wiki/Determinant), written $\left|\mathbf{L}\right|$, which is the product of its diagonal elements, is non-zero.
In the case of a Cholesky factor the determinant will be positive because all the diagonal elements are positive.
A more straightforward way of showing that such a matrix is non-singular is to show how a triangular system of equations, like
$$
\mathbf{Lx}=\mathbf{b}
$$ {#eq-triangularsystem}
can be solved.
In the case of a lower-triangular system the method is called *forward solution*, with the sequence of scalar equations
$$
\begin{aligned}
x_1&=b_1/\mathbf{L}_{1,1}\\
x_2&=\left(b_2-x_1\mathbf{L}_{2,1}\right)/\mathbf{L}_{2,2}\\
x_3&=\left(b_3-x_1\mathbf{L}_{3,1}-x_2\mathbf{L}_{3,2}\right)/\mathbf{L}_{3,3}
\end{aligned}
$$ {#eq-forwardsolve}
and so on.
One point to note here is that $b_1$ is not needed after $x_1$ is evaluated, $b_2$ is not needed after $x_2$ is evaluated, and so on.
That is, the forward solution can be carried out *in place* with each element of $\mathbf{b}$ overwriting the corresponding element of $\bbx$.
This property is useful for avoiding allocation of storage in, for example, evaluation of an objective function during optimization.
The corresponding method of solving an upper-triangular system of equations is called *backward solution*, where $b_n$ is evaluated first, then $b_{n-1}$, and so on.
Repeated forward solution (or backward solution for upper triangular) can be used to evaluate the inverse, $\mathbf{L}^{-1}$, of a lower triangular matrix, $\mathbf{L}$.
However, a general rule in numerical linear algebra is that [you rarely need to evaluate the full inverse of a matrix](https://nhigham.com/2022/10/11/seven-sins-of-numerical-linear-algebra/).
Solving a triangular system like @eq-triangularsystem by evaluating $\mathbf{L}^{-1}$ and forming the product
$$
\bbx = \mathbf{L}^{-1}\mathbf{b}
$$ {#eq-inversesolve}
involves doing roughly $n$ times as much work as solving the system directly, as in @eq-forwardsolve.
Requiring that the inverse of a matrix must be evaluated to solve a linear system is like saying that a quotient, $a/b$, must be evalated by calculating $b^{-1}$, the reciprocal of $b$, then evaluating the product $b^{-1}a$, instead of evaluating the quotient directly.
In a derivation we may write an expression like $\mathbf{L}^{-1}\mathbf{b}$ but the evaluation is performed by solving a system like @eq-forwardsolve.
### Positive definiteness and the Cholesky factor {#sec-cholesky}
It turns out that the ability to form the Cholesky factor, which means that all the quantities like $\bbSigma_{2,2}-\mathbf{L}_{2,1}^2$, whose square roots form the diagonal of $\mathbf{L}$, evaluate to positive numbers, is equivalent to $\bbSigma$ being positive definite.
It is straightforward to show that having a Cholesky factor implies that $\bbSigma$ is positive definite, because
$$
\bbx'\bbSigma\bbx = \bbx'\bbR'\bbR\bbx=\left(\mathbf{Rx}\right)'\mathbf{Rx}=\left\|\mathbf{Rx}\right\|^2
$$ {#eq-cholimpliesposdef}
where $\left\|\mathbf{v}\right\|^2$ is the squared length of the vector $\mathbf{v}$.
Because $\bbR$ is non-singular, $\bbx\ne\mathbf{0}\implies\mathbf{Rx}\ne\mathbf{0}$ and the squared length in @eq-cholimpliesposdef is greater than zero.
The other direction is a bit more complicated to prove but essentially it amounts to showing that if the process of generating the Cholesky factor requires the square root of a non-positive number to obtain a diagonal element then there is a direction in which the quadratic form gives a non-positive result.
In practice, the easiest way to check a symmetric matrix to see if it is positive definite is to attempt to evaluate the Cholesky factor and check whether that succeeds.
This is exactly what the `isposdef` methods in the `LinearAlgebra` package do.
### Density of the multivariate Gaussian {#sec-mvgaussian}
For the general multivariate normal distribution, $\mcN(\bbmu,\bbSigma)$, where $\bbSigma$ is positive definite with lower Cholesky factor $\mathbf{L}$, the probability density function is
$$
\begin{aligned}
f(\bbx;\bbmu,\bbSigma)&=
\frac{1}{\sqrt{(2\pi)^n\left|\bbSigma\right|}}
\exp\left(\frac{-[\bbx-\bbmu]'\bbSigma^{-1}[\bbx-\bbmu]}{2}\right)\\
&=\frac{1}{\sqrt{(2\pi)^n}\left|\mathbf{L}\right|}
\exp\left(\frac{-[\bbx-\bbmu]'{\mathbf{L}'}^{-1}\mathbf{L}^{-1}[\bbx-\bbmu]}{2}\right)\\
&=\frac{1}{\sqrt{(2\pi)^n}\left|\mathbf{L}\right|}
\exp\left(\frac{-\left\|\mathbf{L}^{-1}[\bbx-\bbmu]\right\|^2}{2}\right)\\
\end{aligned}
$$ {#eq-mvndensity}
and the standardizing transformation becomes
$$
\mathbf{z}=\mathbf{L}^{-1}[\bbx-\bbmu] ,
$$ {#eq-mvnstandardizing}
which, in practice, means using forward solution on the lower-triangular system of equations
$$
\mathbf{Lz}=\bbx-\bbmu .
$$ {#eq-mvnstandardizingsol}
Note that the standardizing transformation gives us a way to simulate values from a general $n$-dimensional multivariate Gaussian, $\mcX\sim\mcN(\bbmu,\bbSigma)$ as
$$
\bbx=\bbmu+\mathbf{L}\mathbf{z}
$$ {#eq-simulatemvn}
where $\mathbf{z}$ is simulated from the $n$-dimensional *standard multivariate Gaussian*, $\mcZ\sim\mcN(\mathbf{0},\bbI)$, which is $n$ independent univariate standard normal distributions.
### Linear functions of a multivariate Gaussian
In general, if $\mcX$ is an $n$-dimensional random variable with mean $\bbmu$ and covariance matrix $\bbSigma$, and $\mathbf{A}$ is a matrix with $n$ columns then the mean and variance of $\mcU=\mathbf{A}\mcX$ are given by
$$
\require{unicode}
𝔼\left[\mcU\right] =
𝔼\left[\mathbf{A}\mcX\right] =
\mathbf{A}𝔼\left[\mcX\right] =
\mathbf{A}\bbmu
$$ {#eq-mvexpectedlin}
and
$$
\begin{aligned}
\text{Var}\left(\mcU\right)
&=𝔼\left[\left(\mcU-𝔼\left[\mcU\right]\right)\left(\mcU-𝔼\left[\mcU\right]\right)'\right]\\
&=𝔼\left[\left(\mathbf{A}\mcX-\mathbf{A}\bbmu\right)\left(\mathbf{A}\mcX-\mathbf{A}\bbmu\right)'\right]\\
&=𝔼\left[\mathbf{A}\left(\mcX-\bbmu\right)\left(\mcX-\bbmu\right)'\mathbf{A}'\right]\\
&=\mathbf{A}\,𝔼\left[\left(\mcX-\bbmu\right)\left(\mcX-\bbmu\right)'\right]\mathbf{A}'\\
&=\mathbf{A}\text{Var}(\mcX)\mathbf{A}'\\
&=\mathbf{A}\bbSigma\mathbf{A}'
\end{aligned}
$$ {#eq-mvvarlin}
A linear function, $\mcU=\mathbf{A}\mcX$, of a multivariate Gaussian distribution, $\mcX\sim\mcN(\bbmu,\bbSigma)$, is also Gaussian and these relationships imply that
$$
\mcU\sim\mcN(\mathbf{A}\bbmu, \mathbf{A}\bbSigma\mathbf{A}')
$$ {#eq-mvnlinfunc}
For the special case of $\mathbf{A}$ being of dimension $1\times n$ (i.e. a *row vector*), the expression for the $1\times 1$ covariance matrix is the quadratic form defined by $\bbSigma$, which is why $\bbSigma$ must be positive definite for the conditional distributions to be non-degenerate.
## Back at the linear model
The probability density function for the linear model, @eq-mvnlinmod, is
$$
\begin{aligned}
f(\bby; \bbbeta, \sigma^2)&=
\frac{1}{\sqrt{2\pi\left|\sigma^2\bbI\right|}}
\exp\left(\frac{-[\bby-\bbX\bbbeta]'
\left(\sigma^2\bbI\right)^{-1}[\bby-\bbX\bbbeta]}{2}\right)\\
&=\left(2\pi\sigma^2\right)^{-n/2}\exp\left(-\left\|\bby-\bbX\bbbeta\right\|^2/\left(2\sigma^2\right)\right)
\end{aligned}
$$ {#eq-linmoddensity}
@eq-linmoddensity describes the density of the random variable, $\mcY$, representing the observations, given the values of the parameters, $\bbbeta$ and $\sigma^2$.
For parameter estimation we use the *likelihood function*, which is the same expression as @eq-linmoddensity but regarded as a function of the parameters, $\bbbeta$ and $\sigma^2$, with the observed response, $\bby$, fixed.
$$
L(\bbbeta,\sigma^2;\bby)=
\left(2\pi\sigma^2\right)^{-n/2}\exp\left(-\left\|\bby-\bbX\bbbeta\right\|^2/\left(2\sigma^2\right)\right)
$$ {#eq-linmodlikelihood}
The [maximum likelihood](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) estimates of the parameters are, as the name implies, the values of $\bbbeta$ and $\sigma^2$ that maximize the expression on the right of @eq-linmodlikelihood .
Because the logarithm is a [monotone increasing](https://en.wikipedia.org/wiki/Monotonic_function) function, the maximum likelihood estimates will also maximize the *log-likelihood*
$$
\begin{aligned}
\ell(\bbbeta,\sigma^2;\bby)
&=\log L(\bbbeta,\sigma^2;\bby)\\
&=-\frac{n}{2}\log(2\pi\sigma^2)-\frac{\left\|\bby-\bbX\bbbeta\right\|^2}{2\sigma^2}
\end{aligned}
$$ {#eq-linmodloglike}
Usually the log-likelihood is easier to optimize, either algebraically or numerically, than the likelihood itself.
To avoid the negative signs and the factors of 2 in the denominator, we often convert the log-likelihood to the *deviance scale*, which is negative twice the log-likelihood,
$$
\begin{aligned}
d(\bbbeta,\sigma^2;\bby)
&=-2\ell(\bbbeta,\sigma^2; \bby)\\
&=n\log(2\pi\sigma^2)+\frac{\left\|\bby-\bbX\bbbeta\right\|^2}{\sigma^2} .
\end{aligned}
$$ {#eq-linmoddevscale}
Because of the negative sign, the maximum likelihood estimates are those that *minimize* $d(\bbbeta,\sigma^2;\bby)$.
(The term *deviance scale* is used for $d(\bbbeta,\sigma^2;\bby)$ rather than [deviance](https://en.wikipedia.org/wiki/Deviance_(statistics)) because the deviance involves an additive shift, which is a correction for the saturated model - see the link.
It is obvious what the saturated model should be for the linear model but not for the linear mixed model so, to avoid confusion, we refer to the log-likelihood on the deviance scale as the *objective*.)
The form of @eq-linmoddevscale makes it easy to determine the maximum likelihood estimates.
Because $\bbbeta$ appears only in the sum of squared residuals expression, $\|\bby-\bbX\bbbeta\|^2$, we minimize that with respect to $\bbbeta$
$$
\widehat{\bbbeta}=
\arg\min_{\bbbeta}\|\bby-\bbX\bbbeta\|^2 ,
$$ {#eq-leastsquaresest}
where $\arg\min_{\bbbeta}$ means the value of $\bbbeta$ that minimizes the expression that follows.
Let $r^2(\widehat{\bbbeta}) = \left\|\bby-\bbX\widehat{\bbbeta}\right\|^2$ be the minimum sum of squared residuals.
Substituting this value into @eq-linmoddevscale, differentiating with respect to $\sigma^2$, and setting this derivative to zero gives
$$
\widehat{\sigma^2}=\frac{r^2(\widehat{\bbbeta})}{n} .
$$
## Minimizing the sum of squared residuals
A condition for $\widehat{\bbbeta}$ to minimize the sum of squared residuals is that the *gradient*
$$
\nabla r^2(\bbbeta)=-2\bbX'(\bby-\bbX\bbbeta)
$$ {#eq-sumsqgrad}
be zero at $\widehat{\bbbeta}$.
This condition can be rewritten as
$$
\bbX'\bbX\widehat{\bbbeta}=\bbX'\bby ,
$$ {#eq-normaleq}
which are called the *normal equations*.
The term *normal* in this expression comes from the fact that requiring the gradient, @eq-sumsqgrad, to be zero is equivalent to requiring that the *residual vector*, $\bby-\bbX\widehat{\bbbeta}$, be perpendicular, or *normal*, to the columns of $\bbX$.
When the model matrix, $\bbX$, is of *full column rank*, which means
$$
\bbX\mathbf{\bbbeta}\ne\mathbf{0}\quad\forall\bbbeta\ne\mathbf{0} ,
$$ {#eq-fullcolumnrank}
then the quadratic form defined by $\bbX'\bbX$ is positive definite and has a Cholesky factor, say $\bbR_{XX}$, and the normal equations can be solved in two stages.
First, solve
$$
\bbR_{XX}'\mathbf{r}_{Xy}=\bbX'\bby
$$ {#eq-rXydef}
for $\mathbf{r}_{Xy}$ using forward solution, then solve
$$
\bbR_{XX}\widehat{\bbbeta}=\mathbf{r}_{Xy}
$$ {#eq-betahatchol}
for $\widehat{\bbbeta}$ using backward solution.
An alternative approach is to write the residual sum of squares as a quadratic form
$$
\begin{aligned}
r^2(\bbbeta)&=\|\bby-\bbX\bbbeta\|^2\\
&=(\bby-\bbX\bbbeta)'(\bby-\bbX\bbbeta)\\
&=(\bbX\bbbeta-\bby)'(\bbX\bbbeta-\bby)\\
&=\begin{bmatrix}\bbbeta&-1\end{bmatrix}
\begin{bmatrix}
\bbX'\bbX & \bbX'\bby\\
\bby'\bbX & \bby'\bby
\end{bmatrix}
\begin{bmatrix}
\bbbeta\\
-1
\end{bmatrix}\\
&=\begin{bmatrix}\bbbeta&-1\end{bmatrix}
\begin{bmatrix}
\bbR_{XX}' & \mathbf{0}\\
\mathbf{r}_{Xy}' & r_{yy}
\end{bmatrix}
\begin{bmatrix}
\bbR_{XX} & \mathbf{r}_{Xy}\\
\mathbf{0} & r_{yy}
\end{bmatrix}
\begin{bmatrix}
\bbbeta\\
-1
\end{bmatrix}\\
&=\left\|
\begin{bmatrix}
\bbR_{XX} & \mathbf{r}_{Xy}\\
\mathbf{0} & r_{yy}
\end{bmatrix}
\begin{bmatrix}
\bbbeta\\
-1
\end{bmatrix}\right\|^2\\
&=\left\|\bbR_{XX}\bbbeta-\mathbf{r}_{Xy}\right\|^2+r_{yy}^2
\end{aligned}
$$ {#eq-extendedqf}
The first term, $\left\|\bbR_{XX}\bbbeta-\mathbf{r}_{Xy}\right\|^2$, is non-negative and can be made zero by solving @eq-betahatchol for $\widehat{\bbbeta}$.
Thus, the minimum sum of squared residuals is $r_{yy}^2$.
One consequence of this derivation is that the minimum sum of squared residuals can be evaluated directly from the extended Cholesky factor
$$
\begin{bmatrix}
\bbR_{XX} & \mathbf{r}_{Xy}\\
\mathbf{0} & r_{yy}
\end{bmatrix}
$$ {#eq-extendedcholfac}
without needing to solve for $\widehat{\bbbeta}$ first.
This is not terribly important for a linear model where the evaluation of $\widehat{\bbbeta}$ and the residual is typically done only once.
However, for the linear mixed model, a similar calculation must be done for every evaluation of the objective in the iterative optimization, and being able to evaluate the minimum penalized sum of squared residuals without solving for parameter values and without needing to evaluate the residual saves a non-negligible amount of time and effort.
## Numerical example
Suppose we wish to fit a simple linear regression model to the reaction time as a function of days of sleep deprivation to the data from subject `S372` in the `sleepstudy` dataset.
```{julia}
sleepstudy = Table(dataset(:sleepstudy))
S372 = sleepstudy[sleepstudy.subj .== "S372"]
```
The model matrix and the response vector can be constructed as
```{julia}
X = hcat(ones(length(S372)), S372.days)
```
and
```{julia}
y = S372.reaction
show(y)
```
from which we obtain the Cholesky factor
```{julia}
chfac = cholesky!(X'X)
```
(Recall that the upper triangular Cholesky factor is the `U` property of the `Cholesky` type.)
The `\` operator with a `Cholesky` factor on the left performs both the forward and backward solutions to obtain the least squares estimates
```{julia}
β̂ = chfac \ (X'y)
```
Alternatively, we could carry out the two solutions of the triangular systems explicitly by first solving for $\mathbf{r}_{Xy}$
```{julia}
rXy = ldiv!(chfac.L, X'y)
```
then solving in-place to obtain $\widehat{\bbbeta}$
```{julia}
ldiv!(chfac.U, rXy)
```
The residual vector, $\bby-\bbX\widehat{\bbbeta}$, is
```{julia}
r = y - X * β̂
```
with geometric length or "norm",
```{julia}
norm(r)
```
For the extended Cholesky factor, create the extended matrix of sums of squares and cross products
```{julia}
crprod = let x = S372.days
Symmetric(
[
length(x) sum(x) sum(y)
0.0 sum(abs2, x) dot(x, y)
0.0 0.0 sum(abs2, y)
],
:U,
)
end
```
The call to `Symmetric` with the second argument the symbol `:U` indicates that the matrix should be treated as symmetric but only the upper triangle is given.
The Cholesky factor of the `crprod` reproduces $\bbR_{XX}$, $\mathbf{r}_{Xy}$, and the norm of the residual, $r_{yy}$.
```{julia}
extchfac = cholesky(crprod)
```
and information from which the parameter estimates can be evaluated.
```{julia}
β̂ ≈ ldiv!(
UpperTriangular(view(extchfac.U, 1:2, 1:2)),
copy(view(extchfac.U, 1:2, 3)),
)
```
The operator `≈` is a check of approximate equality of floating point numbers or arrays.
Exact equality of floating point results from "equivalent" calculations cannot be relied upon.
Similarly we check that the value of $r_{y,y}$ is approximately equal to the norm of the residual vector.
```{julia}
norm(r) ≈ extchfac.U[3, 3]
```
## Alternative decompositions of X {#sec-matrixdecomp}
There are two other decompositions of the model matrix $\bbX$ or the augmented model matrix $[\mathbf{X,y}]$ that can be used to evaluate the least squares estimates; the [QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition) and the [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition).
The QR decomposition expresses $\bbX$ as the product of an *orthogonal* matrix, $\mathbf{Q}$, and an upper triangular matrix $\bbR$.
The upper triangular $\bbR$ is related to the upper triangular Cholesky factor in that the numerical values are the same but the signs can be different.
In particular, the usual way of creating $\mathbf{Q}$ and $\bbR$ using [Householder transformations](https://en.wikipedia.org/wiki/Householder_transformation) typically results in the first row of $\bbR$ from the `qr` function being the negative of the first row of the upper Cholesky factor.
```{julia}
qrfac = qr(X)
qrfac.R
```
Just as the Cholesky factor can be used on the left of the `\` operator, so can the `qr` factor but with `y` on the right.
```{julia}
b3 = qrfac \ y
```
The matrix $\bbR$ is returned as a square matrix with the same number of columns as $\bbX$.
That is, if $\bbX$ is of size $n\times p$ where $n>p$, as in the example, then $\bbR$ is $p\times p$, as shown above.
The matrix $\mathbf{Q}$ is usually considered to be an $n\times n$ orthogonal matrix, which means that its transpose is its inverse
$$
\mathbf{Q'Q}=\mathbf{QQ'}=\bbI
$$ {#eq-orthogonalQ}
To form the product $\mathbf{QR}$ the matrix $\bbR$ is treated as if it were $n\times p$ with zeros below the main diagonal.
The $n\times n$ matrix $\mathbf{Q}$ can be very large if $n$, the number of observations, is large but it does not need to be explicitly evaluated.
In practice $\mathbf{Q}$ is a "virtual" matrix represented as a product of Householder reflections that only require storage of the size of $\bbX$.
The effect of multiplying a vector or matrix by $\mathbf{Q}$ or by $\mathbf{Q}'$ is achieved by applying the Householder reflections in a particular order.
```{julia}
rXy2 = qrfac.Q'y
```
```{julia}
b4 = ldiv!(UpperTriangular(qrfac.R), rXy2[1:2])
```
Forming the QR decomposition is a direct, non-iterative, calculation, like forming the Cholesky factor.
Forming the SVD, by contrast, is usually an iterative calculation.
(It should be noted that modern methods for evaluating the SVD are very fast for an iterative calculation.)
The SVD consists of two orthogonal matrices, the $n\times n$ $\mathbf{U}$ and the $p\times p$ $\mathbf{V}$ and an $n\times p$ matrix $\mathbf{S}$ that is zero off the main diagonal, where
$$
\bbX=\mathbf{USV'} .
$$
Unlike the $\mathbf{Q}$ in the QR decomposition, the orthogonal matrices $\mathbf{U}$ and $\mathbf{V}$ are explicitly evaluated.
Because of this, the default for the `svd` function is to produce a compact form where $\mathbf{U}$ is $n\times p$ and only the diagonal of $\mathbf{S}$ is returned.
```{julia}
Xsvd = svd(X)
```
(In general, the `Vt factor` from this decomposition is $\mathbf{V}'$, the transpose of $\mathbf{V}$.
The distinction is not important in this case because $\mathbf{V}$ is symmetric.
)
If all the singular values are non-zero, as is the case here, the least squares solution $\widehat{\bbbeta}$ can be obtained as
$$
\mathbf{V}\mathbf{S}^{-1}\mathbf{U}'\bby
$$ {#eq-pseudoinv}
for the diagonal $\mathbf{S}$.
```{julia}
b5 = Xsvd.V * (Xsvd.U'y ./ Xsvd.S)
```
In the extensions to linear mixed-effects models we will emphasize the Cholesky factorization over the QR decomposition or the SVD.
## Linear mixed-effects models {#sec-lmmtheory}
As described in @bates.maechler.etal:2015 , a linear mixed-effects model is based on two vector-valued random variables: the $q$-dimensional vector of random effects, $\mcB$, and the $n$-dimensional response vector, $\mcY$. @eq-lmmdist defines the unconditional distribution of $\mcB$ and the conditional distribution of $\mcY$, given $\mcB=\mathbf{b}$, as multivariate Gaussian distributions of the form
$$
\begin{aligned}
(\mcY|\mcB=\mathbf{b})&\sim\mcN(\bbX\bbbeta+\mathbf{Z}\mathbf{b},\sigma^2\bbI)\\
\mcB&\sim\mcN(\mathbf{0},\bbSigma_\theta) .
\end{aligned}
$$
The $q\times q$, symmetric, variance-covariance matrix, $\mathrm{Var}(\mcB)=\bbSigma_\theta$, depends on the *variance-component parameter vector*, $\bbtheta$, through a lower triangular *relative covariance factor*, $\bbLambda_\theta$, as
$$
\bbSigma_\theta=\sigma^2\bbLambda_\theta\bbLambda_\theta' .
$$ {#eq-relcovfac}
(Recall that the lower Cholesky factor is generally written $\mathbf{L}$.
In this case the lower Cholesky factor contains parameters and is named with the corresponding Greek letter, $\bbLambda$.)
In many descriptions of linear mixed models, computational formulas are written in terms of the *precision matrix*, $\bbSigma_\theta^{-1}$.
Such formulas will become unstable as $\bbSigma_\theta$ approaches singularity.
And it can do so.
It is a fact that singular (i.e. non-invertible) $\bbSigma_\theta$ can and do occur in practice, as we have seen in some of the examples in earlier chapters.
Moreover, during the course of the numerical optimization by which the parameter estimates are determined, it is frequently the case that the deviance or the REML criterion will need to be evaluated at values of $\bbtheta$ that produce a singular $\bbSigma_\theta$.
Because of this we will take care to use computational methods that can be applied even when $\bbSigma_\theta$ is singular and are stable as $\bbSigma_\theta$ approaches singularity.
According to @eq-relcovfac, $\bbSigma$ depends on both $\sigma$ and $\theta$, and we should write it as $\bbSigma_{\sigma,\theta}$.
However, we will blur that distinction and continue to write $\text{Var}(\mcB)=\bbSigma_\theta$.
Another technicality is that the *common scale parameter*, $\sigma$, could, in theory, be zero.
However, the only way for its estimate, $\widehat{\sigma}$, to be zero is for the fitted values from the fixed-effects only, $\bbX\widehat{\bbbeta}$, to be exactly equal to the observed data.
This occurs only with data that have been (incorrectly) simulated without error.
In practice we can safely assume that $\sigma>0$.
However, $\bbLambda_\theta$, like $\bbSigma_\theta$, can be singular.
The computational methods in the *MixedModels* package are based on $\bbLambda_\theta$ and do not require evaluation of $\bbSigma_\theta$.
In fact, $\bbSigma_\theta$ is explicitly evaluated only at the converged parameter estimates.
The spherical random effects, $\mcU\sim\mcN(\mathbf{0},\sigma^2\bbI_q)$, determine $\mcB$ as
$$
\mcB=\bbLambda_\theta\mcU .
$$ {#eq-sphericalre}
Although it may seem more intuitive to write $\mcU$ as a linear transformation of $\mcB$, we cannot do that when $\bbLambda_\theta$ is singular, which is why @eq-sphericalre is in the form shown.
We can easily verify that @eq-sphericalre provides the desired distribution for $\mcB$.
As a linear transformation of a multivariate Gaussian random variable, $\mcB$ will also be multivariate Gaussian with mean
$$
𝔼\left[\mcB\right]=
𝔼\left[\bbLambda_\theta\mcU\right]=
\bbLambda_\theta\,𝔼\left[\mcU\right]=
\bbLambda_\theta\mathbf{0}=\mathbf{0}
$$
and covariance matrix
$$
\text{Var}(\mcB)=
\bbLambda_\theta\text{Var}(\mcU)\bbLambda_\theta'=
\sigma^2\bbLambda_\theta\bbLambda_\theta'=\bbSigma_\theta
$$
Just as we concentrate on how $\bbtheta$ determines $\bbLambda_\theta$, not $\bbSigma_\theta$, we will concentrate on properties of $\mcU$ rather than $\mcB$.
In particular, we now define the model according to the distributions
$$
\begin{aligned}
(\mcY|\mcU=\mathbf{u})&\sim\mcN(\mathbf{Z}\bbLambda_\theta\mathbf{u}+\bbX\beta,\sigma^2\bbI_n)\\
\mcU&\sim\mcN(\mathbf{0},\sigma^2\bbI_q) .
\end{aligned}
$$ {#eq-condygivenu}
The joint density for $\mcY$ and $\mcU$ is the product of densities of the two distributions in @eq-condygivenu.
That is
$$
f_{\mcY,\mcU}(\bby,\mathbf{u})=
\frac{1}{\left(2\pi\sigma^2\right)^{(n+q)/2}}\exp
\left(\frac{\left\|\bby-\bbX\bbbeta
-\mathbf{Z}\bbLambda_\theta\mathbf{u}\right\|^2+
\left\|\mathbf{u}\right\|^2}{-2\sigma^2}\right) .
$$ {#eq-yujointdensity}
To evaluate the likelihood for the parameters, $\bbtheta$, $\bbbeta$, and $\sigma^2$, given the observed response, $\bby$, we must evaluate the marginal distribution of $\mcY$, which is the integral of $f_{\mcY,\mcU}(\bby,\mathbf{u})$ with respect to $\mathbf{u}$.
This is much simpler if we rewrite the *penalized sum of squared residuals*, $\left\|\bby-\bbX\bbbeta
-\mathbf{Z}\bbLambda_\theta\mathbf{u}\right\|^2+
\left\|\mathbf{u}\right\|^2$, from @eq-yujointdensity, as a quadratic form in $\mathbf{u}$, to isolate the dependence on $\mathbf{u}$
$$
\begin{aligned}
r^2_\theta(\mathbf{u},\bbbeta)
&=
\|\bby-\bbX\bbbeta-\mathbf{Z}\bbLambda_\theta\mathbf{u}\|^2+\|\mathbf{u}\|^2 \\
&=
\left\|
\begin{bmatrix}
\mathbf{Z}\bbLambda_\theta & \bbX & \bby \\
-\bbI_q & \mathbf{0} & \mathbf{0}
\end{bmatrix}
\begin{bmatrix}
-\mathbf{u} \\
-\bbbeta \\
1
\end{bmatrix}
\right\|^2 \\
&=
\begin{bmatrix}
-\mathbf{u} \\
-\bbbeta \\
1
\end{bmatrix} '
\begin{bmatrix}
\bbLambda'\mathbf{Z}'\mathbf{Z}\bbLambda+\bbI & \bbLambda'\mathbf{Z}'\bbX & \bbLambda'\mathbf{Z}'\bby \\
\bbX'\mathbf{Z}\bbLambda & \bbX'\bbX & \bbX'\bby \\
\bby'\mathbf{Z}\bbLambda & \bby'\bbX & \bby'\bby
\end{bmatrix}
\begin{bmatrix}
-\mathbf{u} \\
-\bbbeta \\
1
\end{bmatrix} \\
&= \left\|
\begin{bmatrix}
\bbR_{ZZ} & \bbR_{ZX} & \mathbf{r}_{Zy}\\
\mathbf{0} & \bbR_{XX}' & \mathbf{r}_{Xy}\\
\mathbf{0} & \mathbf{0} & r_{yy}
\end{bmatrix}
\begin{bmatrix}
-\mathbf{u} \\
-\bbbeta \\
1
\end{bmatrix}
\right\|^2\\
&= \| \mathbf{r}_{Zy}-\bbR_{ZX}\bbbeta-\bbR_{ZZ}\mathbf{u} \|^2 +
\| \mathbf{r}_{Xy}-\bbR_{XX}\bbbeta\|^2 + r_{yy}^2 ,
\end{aligned}
$$ {#eq-penalized-rss}
using the Cholesky factor of the blocked matrix,
$$
\begin{aligned}
\bbOmega_\theta&=
\begin{bmatrix}
\bbLambda_\theta'\mathbf{Z'Z}\bbLambda_\theta+\bbI &
\bbLambda_\theta'\mathbf{Z'X} & \bbLambda_\theta'\mathbf{Z'y} \\
\mathbf{X'Z}\bbLambda_\theta & \mathbf{X'X} & \mathbf{X'y} \\
\mathbf{y'Z}\bbLambda_\theta & \mathbf{y'X} & \mathbf{y'y}
\end{bmatrix}\\
& =
\begin{bmatrix}
\bbR_{ZZ}' & \mathbf{0} & \mathbf{0} \\
\bbR_{ZX}' & \bbR'_{XX} & \mathbf{0} \\
\mathbf{r}_{Zy}' & \mathbf{r}'_{Xy} & r_{yy}
\end{bmatrix}
\begin{bmatrix}
\bbR_{ZZ} & \bbR_{ZX} & \mathbf{r}_{Zy} \\
\mathbf{0} & \bbR_{XX} & \mathbf{r}_{Xy} \\
\mathbf{0} & \mathbf{0} & r_{yy}
\end{bmatrix} .
\end{aligned}
$$ {#eq-bigCholfac}
Note that the block in the upper left, $\bbLambda_\theta'\mathbf{Z'Z}\bbLambda_\theta+\bbI$, is positive definite even when $\bbLambda_\theta$ is singular, because
$$
\mathbf{u}'\left(\bbLambda_\theta'\mathbf{Z'Z}\bbLambda_\theta+\bbI\right)\mathbf{u} = \left\|\mathbf{Z}\bbLambda_\theta\mathbf{u}\right\|^2
+\left\|\mathbf{u}\right\|^2
$$ {#eq-Cholfacupperleft}
and the first term is non-negative while the second is positive if $\mathbf{u}\ne\mathbf{0}$.
Thus $\bbR_{ZZ}$, with positive diagonal elements, can be evaluated and its determinant, $\left|\bbR_{ZZ}\right|$, is positive.
This determinant appears in the marginal density of $\mcY$, from which the likelihood of the parameters is evaluated.
To evaluate the likelihood,
$$
L(\bbtheta,\bbbeta,\sigma|\bby) = \int_\mathbf{u} f_{\mcY,\mcU}(\bby,\mathbf{u})\, d\mathbf{u}
$$ {#eq-likelihood-abstract}
we isolate the part of the joint density that depends on $\mathbf{u}$ and perform a change of variable
$$
\mathbf{v}=\bbR_{ZZ}\mathbf{u}+\bbR_{ZX}\bbbeta-\mathbf{r}_{Zy} .
$$ {#eq-u-system}
From the properties of the multivariate Gaussian distribution
$$
\begin{multline*}
\int_{\mathbf{u}}\frac{1}{(2\pi\sigma^2)^{q/2}}
\exp\left(-\frac{\|\bbR_{ZZ}\mathbf{u}+\bbR_{ZX}\bbbeta-\mathbf{r}_{Zy}\|^2}{2\sigma^2}\right)
\,d\mathbf{u}\\
\begin{aligned}
&= \int_{\mathbf{v}}\frac{1}{(2\pi\sigma^2)^{q/2}}
\exp\left(-\frac{\|\mathbf{v}\|^2}{2\sigma^2}\right)|\bbR_{ZZ}|^{-1}\,d\mathbf{v}\\
&=|\bbR_{ZZ}|^{-1}
\end{aligned}
\end{multline*}
$$ {#eq-likelihood-integral}
from which we obtain the likelihood as
$$
L(\bbtheta,\bbbeta,\sigma;\bby)=
\frac{|\bbR_{ZZ}|^{-1}}{(2\pi\sigma^2)^{n/2}}
\exp\left(-\frac{r_{yy}^2 + \|\bbR_{XX}(\bbbeta-\widehat{\bbbeta})\|^2}{2\sigma^2}\right) ,
$$ {#eq-likelihood}
where the conditional estimate, $\widehat{\bbbeta}$, given $\bbtheta$, satisfies
$$
\bbR_{XX}\widehat{\bbbeta} = \mathbf{r}_{Xy} .
$$
Setting $\bbbeta=\widehat{\bbbeta}$
and taking the logarithm provides the estimate of $\sigma^2$,
given $\bbtheta$, as
$$
\widehat{\sigma^2}=\frac{r_\mathbf{yy}^2}{n}
$$ {#eq-sigma-hat}
which gives the *profiled log-likelihood*,
$\ell(\bbtheta|\bby)=\log L(\bbtheta,\widehat{\bbbeta},\widehat{\sigma})$,
on the deviance scale, as
$$
-2\ell(\bbtheta|\bby)=2\log(|\bbR_{ZZ}|) +
n\left(1+\log\left(\frac{2\pi r_{yy}^2(\bbtheta)}{n}\right)\right)
$$ {#eq-profiled-log-likelihood}
One of the interesting aspects of this formulation is that it is not necessary to solve for the conditional estimate of $\bbbeta$ or the conditional modes of the random effects when evaluating the log-likelihood.
The two values needed for the log-likelihood evaluation, $2\log(|\bbR_{ZZ}|)$ and $r_\mathbf{yy}^2$, are obtained directly from the diagonal elements of the Cholesky factor.
Furthermore, $\bbOmega_{\theta}$ and, from that, the Cholesky factor, $\bbR_{\theta}$, and the objective to be optimized can be evaluated for a given value of $\bbtheta$ from
$$
\mathbf{A} = \begin{bmatrix}
\mathbf{Z}^\prime\mathbf{Z} & \mathbf{Z}^\prime\bbX & \mathbf{Z}^\prime\bby \\
\bbX^\prime\mathbf{Z} & \bbX^\prime\bbX & \bbX^\prime\bby \\
\bby^\prime\mathbf{Z} & \bby^\prime\bbX & \bby^\prime\bby
\end{bmatrix}
$$ {#eq-A}
and $\bbLambda_{\theta}$.
In the `MixedModels` package the `LinearMixedModel` struct contains a symmetric blocked array in the `A` field and a similarly structured lower-triangular blocked array in the `L` field.
Evaluation of the objective simply involves updating the template matrices, $\bbLambda_i, i=1,\dots,k$ in the `ReMat` structures then updating `L` from `A` and the $\lambda_i$.
## The REML criterion {#sec-REML}
The so-called REML estimates of variance components are often preferred to the maximum likelihood estimates.
("REML" can be considered to be an acronym for "restricted" or "residual" maximum likelihood, although neither term is completely accurate because these estimates do not maximize a likelihood.)
We can motivate the use of the REML criterion by considering a linear regression model,
$$
\mcY\sim\mcN(\bbX\bbbeta,\sigma^2\bbI_n),
$$ {#eq-20}
in which we typically estimate $\sigma^2$ as
$$
\widehat{\sigma^2_R}=\frac{\|\bby-\bbX\widehat{\bbbeta}\|^2}{n-p}
$$ {#eq-21}
even though the maximum likelihood estimate of $\sigma^2$ is
$$
\widehat{\sigma^2_{L}}=\frac{\|\bby-\vec
X\widehat{\bbbeta}\|^2}{n} .
$$ {#eq-22}
The argument for preferring $\widehat{\sigma^2_R}$ to $\widehat{\sigma^2_{L}}$ as an estimate of $\sigma^2$ is that the numerator in both estimates is the sum of squared residuals at $\widehat{\bbbeta}$ and, although the residual vector, $\bby-\bbX\widehat{\bbbeta}$, is an $n$-dimensional vector, it satisfies $p$ linearly independent constraints, $\bbX'(\bby-\bbX\widehat{\bbbeta})=\mathbf{0}$.
That is, the residual at $\widehat{\bbbeta}$ is the projection of the observed response vector, $\bby$, into an $(n-p)$-dimensional linear subspace of the $n$-dimensional response space.
The estimate $\widehat{\sigma^2_R}$ takes into account the fact that $\sigma^2$ is estimated from residuals that have only $n-p$ *degrees of freedom*.
Another argument often put forward for REML estimation is that $\widehat{\sigma^2_R}$ is an *unbiased* estimate of $\sigma^2$, in the sense that the expected value of the estimator is equal to the value of the parameter.
However, determining the expected value of an estimator involves integrating with respect to the density of the estimator and we have seen that densities of estimators of variances will be skewed, often highly skewed.
It is not clear why we should be interested in the expected value of a highly skewed estimator.
If we were to transform to a more symmetric scale, such as the estimator of the standard deviation or the estimator of the logarithm of the standard deviation, the REML estimator would no longer be unbiased.
Furthermore, this property of unbiasedness of variance estimators does not generalize from the linear regression model to linear mixed models.
This is all to say that the distinction between REML and ML estimates of variances and variance components is probably less important than many people believe.
Nevertheless it is worthwhile seeing how the computational techniques described in this chapter apply to the REML criterion because the REML parameter estimates $\widehat{\bbtheta}_R$ and $\widehat{\sigma_R^2}$ for a linear mixed model have the property that they would specialize to $\widehat{\sigma^2_R}$ from @eq-21 for a linear regression model, as seen in @sec-dyestuff2lmm.
Although not usually derived in this way, the REML criterion (on the deviance scale) can be expressed as
$$
d_R(\bbtheta,\sigma|\bby)=-2\log
\int_{\mathbb{R}^p}L(\bbtheta,\bbbeta,\sigma|\bby)\,d\bbbeta .
$$ {#eq-23}
The REML estimates $\widehat{\bbtheta}_R$ and $\widehat{\sigma_R^2}$
minimize $d_R(\bbtheta,\sigma|\bby)$.
To evaluate this integral we form an expansion, similar to @eq-likelihood, of $r^2_{\theta,\beta}$ about $\widehat{\bbbeta}_\theta$
$$
r^2_{\theta,\beta}=r^2_\theta+\|\bbR_{XX}(\bbbeta-\widehat{\bbbeta}_\theta)\|^2 .
$$ {#eq-rsqbetathetaexp}
from which we can derive
$$
\int_{\mathbb{R}^p}\frac{\exp\left(-\frac{r^2_{\theta,\beta}}{2\sigma^2}\right)}
{(2\pi\sigma^2)^{n/2}|\bbR_{ZZ}|} \,d\bbbeta=
\frac{\exp\left(-\frac{r^2_\theta}{2\sigma^2}\right)}
{(2\pi\sigma^2)^{(n-p)/2}|\bbR_{ZZ}||\bbR_X|}
$$ {#eq-betaintegral}
corresponding to a REML criterion on the deviance scale of
$$
d_R(\bbtheta,\sigma|\bby)=(n-p)\log(2\pi\sigma^2)+
2\log\left(|\bbR_{ZZ}||\bbR_X|\right)+\frac{r^2_\theta}{\sigma^2} .
$$ {#eq-remldev}
Plugging in the conditional REML estimate, $\widehat{\sigma^2}_R=r^2_\theta/(n-p)$, provides the profiled REML criterion
$$
\tilde{d}_R(\bbtheta|\bby)=
2\log\left(|\bbR_{ZZ}||\bbR_X|\right)+(n-p)
\left[1+\log\left(\frac{2\pi r^2_\theta}{n-p}\right)\right].
$$ {#eq-24}
The REML estimate of $\bbtheta$ is
$$
\widehat{\bbtheta}_R=\arg\min_{\bbtheta}\tilde{d}_R(\bbtheta|\bby) ,
$$ {#eq-31}
and the REML estimate of $\sigma^2$ is the conditional REML estimate of $\sigma^2$ at $\widehat{\bbtheta}_R$,
$$
\widehat{\sigma^2_R}=r^2_{\widehat\theta_R}/(n-p) .
$$ {#eq-remlsigmasq}
It is not entirely clear how one would define a "REML estimate" of $\bbbeta$
because the REML criterion, $d_R(\bbtheta,\sigma|\bby)$, defined in @eq-remldev, does not depend on $\bbbeta$.
However, it is customary (and not unreasonable) to use
$\widehat{\bbbeta}_R=\widehat{\bbbeta}_{\widehat{\bbtheta}_R}$ as
the REML estimate of $\bbbeta$.