Metrum Research Group
- Swiss Fertility and Socioeconomic Indicators
- Optimization the
R
way - Use other optimizers
- Maximum likelihood estimation
- Extended least squares
- Get standard error of estimate
- Plot predicted and observed values
- Let’s try it
library(tidyverse)
## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'
library(broom)
library(ggplot2)
theme_set(theme_bw())
We’ll get started with the introduciton to optimization with an easy data set and an easy model.
Data were collected from 47 French-speaking provences around 1888.
data(swiss)
glimpse(swiss, width = 60, strict.width="cut")
. Rows: 47
. Columns: 6
. $ Fertility <dbl> 80.2, 83.1, 92.5, 85.8, 76.9, 76…
. $ Agriculture <dbl> 17.0, 45.1, 39.7, 36.5, 43.5, 35…
. $ Examination <int> 15, 6, 5, 12, 17, 9, 16, 14, 12,…
. $ Education <int> 12, 9, 5, 7, 15, 7, 7, 8, 7, 13,…
. $ Catholic <dbl> 9.96, 84.84, 93.40, 33.77, 5.16,…
. $ Infant.Mortality <dbl> 22.2, 22.2, 20.2, 20.3, 20.6, 26…
ggplot(swiss, aes(Examination,Fertility)) + geom_point() + geom_smooth()
We’ll work on a regression model for the standardized fertility measure
(dependent variable) as function of the Examination
predictor (percent
draftees receiving highest mark on army examination).
Usually we’d fit this model in R like this:
fit <- lm(Fertility ~ Examination, swiss)
tidy(fit)
. # A tibble: 2 x 5
. term estimate std.error statistic p.value
. <chr> <dbl> <dbl> <dbl> <dbl>
. 1 (Intercept) 86.8 3.26 26.7 3.35e-29
. 2 Examination -1.01 0.178 -5.68 9.45e- 7
Using lm
is the “right” way to model this data. We’re going to write
come code that also get parameters for this data. But we’ll put all of
the pieces together ourselves. So this isn’t the proper way to get these
parameter estimates. But this simple example will help us better
understand the mechanics of parameter optimization in R.
We’ll need
- A model
- The model generates the data (dependent variable) based on parameters and predictors
- Some data
- Using
swiss
for now
- Using
- An objective function
- To quantify how consistent a set of parameters are with the observed data
- An optimizer
- Search the parameter space for most optimal parameter value
The parameters are
intercept
slope
And the predictor is ex
… in our example Examination
This is pretty simple stuff. But we’ll wrap it up in a function to call like this:
linear_model <- function(intercept, slope, ex) {
intercept + slope * ex
}
So we can get the predicted Fertility
by passing in the intercept,
slope, and the Examination
value
E(y|x)
linear_model(intercept = 90, slope = -2, ex = 20)
. [1] 50
How you want to set this up is a bit dependent on your application. I’m
going to get vectors called ex
for the Examination
value (predictor)
and fer
for Fertility
(the dependent variable).
ex <- swiss[["Examination"]]
fer <- swiss[["Fertility"]]
We’ll write this function so that
- The first argument is
par
, the parameters we want to evaluate- Par will be a vector of length 2, the intercept and the slope
- We will also pass in the predictor (
ex
) and the data (fer
), which we’ll need to calculate the objective function
ofv <- function(par, ex, fer) {
fer_hat <- linear_model(par[1], par[2], ex)
sum((fer-fer_hat)^2)
}
This is an O rdinary L east S quares objective function.
Working backward:
- We return the squared difference between the predicted values
(
fer_hat
) and the data - We generate the predicted values from our linear model function, the proposed parameters and the data
- The optimizer will propose a set of parameters for us to evaluate
Let’s test the objective function
theta <- c(70, -2)
ofv(theta, ex, fer)
. [1] 58605.31
Good or bad? Looking back at the data, the intercept doesn’t look like it is 70 … more like 80. Let’s try that:
theta <- c(80,-2)
ofv(theta, ex, fer)
. [1] 32171.31
Ok the objective function is lower now. The second set of parameters we tried looks better than the first set.
What about slope?
theta <- c(80,-1.5)
ofv(theta, ex, fer)
. [1] 15284.46
This is even better. But we can’t keep going like this.
Let’s do this for a big batch of parameters
- intercept from 75 to 95
- slope from -2 to 0
test <- expand.grid(intercept = seq(75,95,1), slope = seq(-2,0,0.1))
head(test)
. intercept slope
. 1 75 -2
. 2 76 -2
. 3 77 -2
. 4 78 -2
. 5 79 -2
. 6 80 -2
Now calculate the value of the objective function for each paramter set
test <- mutate(
test,
value = pmap_dbl(test, .f=function(intercept,slope) {
ofv(c(intercept,slope), ex = ex, fer = fer)
})
)
arrange(test,value) %>% head
. intercept slope value
. 1 87 -1.0 4190.31
. 2 86 -1.0 4202.71
. 3 88 -1.1 4210.30
. 4 85 -0.9 4219.86
. 5 89 -1.1 4230.90
. 6 84 -0.9 4265.26
ggplot(test) + geom_contour(aes(intercept,slope,z=value),bins=80)
We know there is a set of parameters that really gets us the smallest value of the objective function and are therefor the “optimal” parameters.
We invoke an optizer in R to search the parameter space and find that set of parameters.
Start with an optimizer that comes with R in the stats
package.
optim
by default does Nelder-Mead optimization algorithm.
When we call optim
, we have to give an inital guess (par
) and the
function to minimize (ofv
). We also pass in the predictor and the
vector of observed data so we can calculate the sum of squares.
fit <- optim(c(100,1), ofv, ex = ex, fer = fer)
fit$par
. [1] 86.822915 -1.011682
lm(Fertility~Examination, swiss)
.
. Call:
. lm(formula = Fertility ~ Examination, data = swiss)
.
. Coefficients:
. (Intercept) Examination
. 86.819 -1.011
library(minqa)
fit <- newuoa(theta, ofv, ex = ex, fer = fer, control = list(iprint=20))
. npt = 4 , n = 2
. rhobeg = 0.95 , rhoend = 9.5e-07
. start par. = 80 -1.5 fn = 15284.46
. rho: 0.095 eval: 5 fn: 4835.84 par: 80.0000 -0.550000
. 20: 4319.4571: 83.2325 -0.872836
. rho: 0.0095 eval: 24 fn: 4250.71 par: 84.0677 -0.868220
. 40: 4188.6677: 86.0555 -0.973636
. rho: 0.00095 eval: 55 fn: 4184.30 par: 86.6095 -1.00569
. 60: 4183.8628: 86.6348 -1.00216
. rho: 9.5e-05 eval: 74 fn: 4183.57 par: 86.8189 -1.01133
. rho: 9.5e-06 eval: 78 fn: 4183.57 par: 86.8185 -1.01131
. 80: 4183.5671: 86.8185 -1.01132
. rho: 9.5e-07 eval: 81 fn: 4183.57 par: 86.8185 -1.01131
. At return
. eval: 89 fn: 4183.5671 par: 86.8185 -1.01132
fit$par
. [1] 86.818529 -1.011317
Differential evolution algorithm
library(DEoptim)
lower <- c(intercept=0, slope=-100)
upper <- c(intercept = 1000, slope=100)
con <- DEoptim.control(itermax = 80, trace = 2)
set.seed(112233)
fit <- DEoptim(ofv, lower, upper, ex = ex, fer = fer, control = con)
. Iteration: 2 bestvalit: 318953.193160 bestmemit: 265.976666 -8.389177
. Iteration: 4 bestvalit: 150708.467549 bestmemit: 213.881420 -7.849656
. Iteration: 6 bestvalit: 67109.936741 bestmemit: 150.706030 -2.848407
. Iteration: 8 bestvalit: 20714.032107 bestmemit: 90.987682 -0.195885
. Iteration: 10 bestvalit: 20714.032107 bestmemit: 90.987682 -0.195885
. Iteration: 12 bestvalit: 4444.885030 bestmemit: 90.987682 -1.133708
. Iteration: 14 bestvalit: 4444.885030 bestmemit: 90.987682 -1.133708
. Iteration: 16 bestvalit: 4444.885030 bestmemit: 90.987682 -1.133708
. Iteration: 18 bestvalit: 4444.885030 bestmemit: 90.987682 -1.133708
. Iteration: 20 bestvalit: 4214.123010 bestmemit: 85.878019 -0.926806
. Iteration: 22 bestvalit: 4214.123010 bestmemit: 85.878019 -0.926806
. Iteration: 24 bestvalit: 4214.123010 bestmemit: 85.878019 -0.926806
. Iteration: 26 bestvalit: 4214.123010 bestmemit: 85.878019 -0.926806
. Iteration: 28 bestvalit: 4214.123010 bestmemit: 85.878019 -0.926806
. Iteration: 30 bestvalit: 4194.444474 bestmemit: 87.819083 -1.072268
. Iteration: 32 bestvalit: 4183.716589 bestmemit: 86.949012 -1.017892
. Iteration: 34 bestvalit: 4183.716589 bestmemit: 86.949012 -1.017892
. Iteration: 36 bestvalit: 4183.716589 bestmemit: 86.949012 -1.017892
. Iteration: 38 bestvalit: 4183.716589 bestmemit: 86.949012 -1.017892
. Iteration: 40 bestvalit: 4183.716589 bestmemit: 86.949012 -1.017892
. Iteration: 42 bestvalit: 4183.687877 bestmemit: 86.935280 -1.017368
. Iteration: 44 bestvalit: 4183.599647 bestmemit: 86.759307 -1.008059
. Iteration: 46 bestvalit: 4183.594865 bestmemit: 86.768295 -1.008241
. Iteration: 48 bestvalit: 4183.594865 bestmemit: 86.768295 -1.008241
. Iteration: 50 bestvalit: 4183.567802 bestmemit: 86.823236 -1.011722
. Iteration: 52 bestvalit: 4183.567474 bestmemit: 86.813777 -1.011176
. Iteration: 54 bestvalit: 4183.567474 bestmemit: 86.813777 -1.011176
. Iteration: 56 bestvalit: 4183.567474 bestmemit: 86.813777 -1.011176
. Iteration: 58 bestvalit: 4183.567347 bestmemit: 86.817505 -1.011155
. Iteration: 60 bestvalit: 4183.567254 bestmemit: 86.814953 -1.011134
. Iteration: 62 bestvalit: 4183.567149 bestmemit: 86.819223 -1.011336
. Iteration: 64 bestvalit: 4183.567149 bestmemit: 86.819223 -1.011336
. Iteration: 66 bestvalit: 4183.567149 bestmemit: 86.819223 -1.011336
. Iteration: 68 bestvalit: 4183.567149 bestmemit: 86.819223 -1.011336
. Iteration: 70 bestvalit: 4183.567146 bestmemit: 86.817748 -1.011278
. Iteration: 72 bestvalit: 4183.567146 bestmemit: 86.817748 -1.011278
. Iteration: 74 bestvalit: 4183.567144 bestmemit: 86.818997 -1.011346
. Iteration: 76 bestvalit: 4183.567143 bestmemit: 86.818129 -1.011302
. Iteration: 78 bestvalit: 4183.567141 bestmemit: 86.818502 -1.011316
. Iteration: 80 bestvalit: 4183.567141 bestmemit: 86.818502 -1.011316
Let’s write a new (R) function where we optimize based on a normal likelihood function.
The arguments are the same as the OLS function. Now, rather than comparing predictions against data using sum of squares, we compare based on normal likelihood function.
ml <- function(p, ex, fer) {
fer_hat <- linear_model(p[1], p[2], ex)
like <- dnorm(fer, fer_hat, p[3], log = TRUE)
-1*sum(like)
}
Note
- We have an extra parameter now … the standard deviation for likelihood function
- We use
log=TRUE
to get the log likelihood; then the joint likelihood of all the data is the sum of the individual likelihoods - We return minus-1 times the log likelihood; we are doing maximum likelihood but the optimizers find the minimum of a function
Test the function now
theta <- c(intercept = 10, slope = 1, sd = 2)
ml(theta, ex, fer)
. [1] 13274.61
And we get the same answer
fit <- newuoa(theta, ml, ex = ex, fer = fer)
fit$par
. [1] 86.818530 -1.011317 9.434621
fit$fval
. [1] 172.1763
els <- function(p, ex, fer) {
fer_hat <- linear_model(p[1], p[2], ex)
0.5 * sum((fer - fer_hat)^2/p[3] + log(p[3]))
}
fit.els <- newuoa(theta, els, ex = ex, fer = fer)
fit.els$par
. [1] 86.818523 -1.011317 89.012071
fit.els$fval
. [1] 128.9861
We use numDeriv::hessian
to get the hessian
library(numDeriv)
he <- hessian(ml, fit$par, ex = ex, fer = fer)
he
. [,1] [,2] [,3]
. [1,] 5.280182e-01 8.706684e+00 -4.878745e-09
. [2,] 8.706684e+00 1.764592e+02 2.751098e-07
. [3,] -4.878745e-09 2.751098e-07 1.056036e+00
To derive the standard error
- Invert the hessian matrix
- Get the diagonal elements
- Take the squre root
he %>% solve() %>% diag() %>% sqrt()
. [1] 3.1875394 0.1743644 0.9731070
And compare against the answer we got from lm
lm(Fertility ~ Examination, data = swiss) %>%
tidy() %>%
pull(std.error)
. [1] 3.2576034 0.1781971
You can also try nlme::fdHess
library(nlme)
he <- fdHess(pars = fit$par, fun = ml, ex = ex, fer = fer)
he$Hessian %>% solve() %>% diag() %>% sqrt()
. [1] 3.1875681 0.1743662 0.9731027
In my experience, it is frequently necessary to just bootstrap the data set. We will look at likelihood profile in a separate vignette.
Take the final parameter estimates
fit$par
. [1] 86.818530 -1.011317 9.434621
and pass them into our linear_model
to generate predicted values.
data <- tibble(
ex = ex,
fer = fer,
pred = linear_model(fit$par[1], fit$par[2], ex)
)
ggplot(data = data) +
geom_point(aes(x = ex, y = fer)) +
geom_line(aes(x = ex, y = pred), lwd = 2, col="red3")
data <- readRDS("data/pdfit.RDS")
head(data)
. # A tibble: 6 x 2
. auc response
. <dbl> <dbl>
. 1 66.3 109.
. 2 22.9 66.3
. 3 22.2 75.9
. 4 81.4 116.
. 5 109. 129.
. 6 98.6 119.
ggplot(data, aes(auc,response)) + geom_point()