Improve code that estimates parameters using max likelihood, optim, differential equations and automatic differentiation?

I’m trying to find parameters for a model that is in a differential equation form. I created three series of data, and I want the starting point of the differential equation for each series to be an additional parameter to be estimated.

I created a code that does this in Julia and I want to use automatic differentiation because I’ll needed for when I have more than 3 series (~100). I’ve done this in R (without autodiff) and I was hoping Julia would be faster by providing the gradient. Nevertheless, I still can do this faster in R, even without using the gradient. Why my code is not efficient?

This is the reproducible example:

using DifferentialEquations 
using Optim 
using ForwardDiff
using DataFrames
using Distributions
using SparseArrays

#Data

AGE = [1.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0,
       1.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.2,
       1.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.5]
HD  = [ 0.04,  1.51,  3.19,  6.55,  7.92,10.33, 12.66, 14.91, 15.41, 18.81,
    0.04,  1.81,  3.39,  6.05,  7.02, 10.63, 12.86, 15.51, 15.91, 18.02,
    0.04,  1.31,  3.09,  7.15,  8.2, 11.33, 12.36, 14.21, 15.01, 19.21]

df = DataFrame(Age = AGE, HDom = HD, Plot =repeat([1,2,3], inner=10))

I create the following to find the initial value per series (called Plot) later:

function make_dummy(data::AbstractVector)
    udata = unique(data)
    inds  = map(x -> findfirst(==(x), udata), data)
    sparse(1:length(data), inds, ones(Int, length(data)))
end
nplot         = length(unique(df.Plot))
Mtx           = make_dummy(df.Plot)

Then, the differential equation:

function f(h,p,t)
    p[1]*h[1]*(1/t^(1+p[2]))
end
function DEsol_l(Age,hini,parms)

    time_int      = (Age[1],  Age[length(Age)])    
    prob          = ODEProblem(f, hini, time_int, parms)
    sol           = solve(prob, saveat = Age)
    return sol.u
end

Then, the likelihood function

function loglik3l(θ)
    
    sigma         = exp(θ[nplot+3])
    df[!, "hini"] = Mtx*θ[1:nplot]  
    hpred         = combine(groupby(df, [:Plot]), df -> DEsol_l(df.Age,df.hini[1],θ[(nplot+1):(nplot+2)]))
    residual      = hpred[:,2] .- df.HDom
    result        = sum(logpdf.(Normal(0, sigma), residual))
    
    return -result

end

Then I create a vector of initial values from the data:

inits3l = combine(groupby(df, :Plot), :HDom => minimum)
inits3l = inits3l[!,2]
inits3l = [inits3l; 8; 0.9; -0.5]

Use automatic differentiation for finding the gradient of the likelihood function:

od3l             = OnceDifferentiable(loglik3l, inits3l; autodiff= :forward);

And run it:

result_ad3l = optimize(od3l, inits3l,Newton())

I get good results here:

0.344979 seconds (1.20 M allocations: 93.444 MiB)
 * Status: success

 * Candidate solution
    Final objective value:     2.606498e+01

 * Found with
    Algorithm:     Newton's Method

But when I use more data (real data), with ~80 series it takes 1.3 hours! to run. And I can do the same in R in ~20 minutes.

What can be improved?

#----R Code below for comparison

AGE <- c(1.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0,
       1.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.2,
       1.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.5)

HD  <- c( 0.04,  1.51,  3.19,  6.55,  7.92,10.33, 12.66, 14.91, 15.41, 18.81,
        0.04,  1.81,  3.39,  6.05,  7.02, 10.63, 12.86, 15.51, 15.91, 18.02,
        0.04,  1.31,  3.09,  7.15,  8.2, 11.33, 12.36, 14.21, 15.01, 19.21)

df <- data.frame(Age = AGE, HDom = HD, Plot =rep(c(1,2,3), each=10))
setDT(df)

DummyMtxT<-function(TRT)
{ uplot <- unique(TRT)
do.call("rbind", lapply(TRT, function(x)
  as.numeric(uplot==x)))
}

nplot <- length(unique(df$Plot))
Mtx   <- DummyMtxT(df$Plot)

fx1 <- function(t,h,p)
{
  p1  <- p['p1'] 
  p2  <- p['p2']

  hdot = p1*h*(1/t^(1+p2))
    
  return(list(H = hdot))
}

solveDE1 <- function(y, times, parms)
{
  res <- rk4(y      = y,
             times    = times,
             p          = parms,
             func     = fx1)
  return(list(time = res[,1], Height = res[,2]))
}


log.lik1 <- function(params, data)
{
  
  data$hinits <- Mtx %*% params[1:ncol(Mtx)]
  
  model <- data[,solveDE1(            y      = min(hinits),
                                      times  = Age,
                                      parms  = params),
                by  = Plot]
  
  value <- sum(dnorm(model[[3]], data$HDom, exp(params["sds"]), log = TRUE ))
  
  
  return(-value)
  
}


h0        <- df[,min(HDom), by =  Plot]
inits     <- c(h0 = h0[[2]],p1=8,p2=0.9,sds=-0.5)


mod1      <- with(df,
                  optimr(par      = inits,
                         fn       = log.lik1,
                         data     = df,
                         method   = "Nelder-Mead",
                         control  = list(maxit=10000)))

mod1
mod1$par

Can you post the R code? It’s often much easier to see what’s going wrong when you have the faster code to compare.

I added the code in R for the same example. I hadn’t done it before, but I noticed that I even get slightly different values for the parameters. I know I changed the optimization method. But I noticed that this is not the problem.

Hessian-based optimization is expensive on large systems, especially Newton. Did you try BFGS instead?

The R code is using a derivative-free method so its convergence is slow but doing 10000 iterations is pretty easy (since there are no derivatives to calculate). You may want to start with Nelder-Mead in Julia as well, and then go to BFGS.

I’ve tried several methods in both languages. I often run several of them to see if I get consistent estimates of the parameters. Doesn’t BFGS also use the hessian? I used Nelder-Mead for the real data, and that was when I took 1.3 hours.
image

If I use BFGS I get this error message:

“Warning: Automatic dt set the starting dt as NaN, causing instability. Exiting.”

It uses an approximate inverse hessian calculation, which isn’t as intensive as the hessian

If I use BFGS I get this error message:
“Warning: Automatic dt set the starting dt as NaN, causing instability. Exiting.”

That’s a warning, so it shouldn’t be an issue, it is just because the parameters there might cause some issues for the differential equation to solve. Does the optimization not proceed on after that?

No, it doesn’t continue.

I just created another question. I modified the code and it is more efficient in both languages, although I still have the same issue of faster convergence in R.

See:
Similar question