Gradient not updating in call to optimize (Optim.jl)

I’m trying to use the optimize function as part of the Optim.jl package with gradients that I provide. The problem I am having is that the function isn’t actually iterating–it simply computes the objective function once and then claims success, i.e.,

res = optimize( Optim.only_fg!((F,G,𝜹)->fg!(F,G,𝜹,storage)), zeros(J),
          LBFGS(), Optim.Options(show_trace = true, iterations = 3))

results in

 Iter     Function value   Gradient norm
     0     6.931472e+00              NaN

So it appears that the G (the gradient) isn’t getting updated in the call to the fg! function I defined. I double-checked that I can successfully manually call the gradient function I have defined and get the correct values of the gradient vector returned. I’m new to Julia (coming from R) so the concept of values from variables getting updated inside functions without explicitly reassigning new values to the variable is a bit like black magic.

Below is my code for full reproducibility. Any help would be appreciated

using DataFrames
using Optim


function m(df::DataFrame, 𝜹::Vector{Float64})
        # Inputs
        #   df - DataFrame with each row a pairwise comparisons.
        #           jlo (Int64) - indexes the lower-indexed item in the comparison,
        #           jhi (Int64) - indexes the higher-indexed item in the comparison,
        #           y (Float64) - indicates whether the lower-indexed item was passed,
        #                          while the higher-indexed item was not passed.
        #   𝜹 - Vector{Float64} of item difficulties.
        # Output: ŷ the predicted probability that
        # Extract the relevant input/response information from the DataFrame
        jlo = df[:,:jlo]; jhi = df[:,:jhi];

        # For model identification, the first parameter is constrained to take on a value of zero
        𝜹c = [0; 𝜹]

        #Complute fij (like yhat) function (see Zwinderman, 1995, Eqns 1-3)
        ŷ = exp.(𝜹c[jlo]) ./ (exp.(𝜹c[jlo]) .+ exp.(𝜹c[jhi]))

        return ŷ # \hat{y}
end

function loss(ŷ::Vector{Float64},y::Vector{Float64})
    # Inputs: 
    #       - ŷ the predicted probabilities
    #       - y actual responses (binary)
    # Outputs: the negative psuedologlikelihood loss value
    𝓁oss = -1.0*sum(y.*log.(ŷ) + y.*log.(1.0 .- ŷ)) #negative to make minimization problem
    return 𝓁oss
end

function gradloss(ŷ::Vector{Float64}, y::Vector{Float64}, jlo::Vector{Int64}, jhi::Vector{Int64}, J::Int64)
    ∂𝜹c = zeros(J+1)
    for i = 1:length(y)
        jilo = jlo[i]
        jihi = jhi[i]

        ∂i = y[i]*(1.0-ŷ[i]) -  (1.0-y[i])*ŷ[i]

        ∂𝜹c[jilo] -= ∂i #flipped sign to make a minimization problem
        ∂𝜹c[jihi] += ∂i #flipped sign to make a minimization problem
    end
    return ∂𝜹c[2:(J+1)]
end

function fg!(F,G,𝜹,storage)
    #Retreive quantities out of storage 
    J = storage["J"]
    x = storage["x"]
    y = x[:,:y]
    jlo = x[:,:jlo]
    jhi = x[:,:jhi]

    #Do common calculations
    ŷ = m(x,𝜹)

    if G != nothing
        G = gradloss(ŷ,y,jlo,jhi,J)
    end
    if F != nothing
        return loss(ŷ,y)
    end
end


# Make toy data
x = DataFrame(y = [1.,1.,1.,0.,1.,1.],
              jlo = [1,1,1,2,2,3],
              jhi = [2,3,4,3,4,4])

J = 3
storage = Dict("x" => x, "J" => J)

Gtest = gradloss(m(x,zeros(J)),x[:,:y],x[:,:jlo],x[:,:jhi],storage["J"])
println(string("Gradient: ",Gtest)) #[-0.5,1.5,0]

res = optimize( Optim.only_fg!((F,G,𝜹)->fg!(F,G,𝜹,storage)), zeros(J),
          LBFGS(), Optim.Options(show_trace = true, iterations = 3))

The problem is you need to fill G with the gradient.
Basically, your function will either receive G = nothing, or G as a vector.
Think of a vector as like a bucket, a container that holds contents.

Think of it like this. If Optim wants gradients, it hands fg! a bucket (G) and expects fg! to fill that bucket with gradient values.
What your fg! is doing instead upon receiving the bucket is filling a different bucket instead.

Observe:

julia> g = [100.0, 200.0, 300.0];

julia> fg!(0,g,y,storage)
7.253886037615595

julia> g'
1×3 adjoint(::Vector{Float64}) with eltype Float64:
 100.0  200.0  300.0

The smallest change to your program to make it work is replacing G = gradloss(...), which replaces your local G with another one, is to instead copy the values gradloss returns into G with a .= (notice the dot):

julia> function fg!(F,G,𝜹,storage)
           #Retreive quantities out of storage
           J = storage["J"]
           x = storage["x"]
           y = x[:,:y]
           jlo = x[:,:jlo]
           jhi = x[:,:jhi]

           #Do common calculations
           ŷ = m(x,𝜹)

           if G != nothing
               G .= gradloss(ŷ,y,jlo,jhi,J)
           end
           if F != nothing
               return loss(ŷ,y)
           end
       end
fg! (generic function with 1 method)

julia> fg!(0,g,y,storage)
7.253886037615595

julia> g'
1×3 adjoint(::Vector{Float64}) with eltype Float64:
 1.06837  -0.730916  1.44107

Anyway, your gradients are wrong so this still won’t work.
Removing unnecessary type annotations, and then

julia> using ForwardDiff

julia> ForwardDiff.gradient(x -> fg!(0,nothing,x,storage), y)
3-element Vector{Float64}:
  0.7388941121629948
 -0.06398338695769656
 -0.11786820640342854

julia> g
3-element Vector{Float64}:
  1.0683716323419565
 -0.7309162697393076
  1.4410658967982857
5 Likes

Thank you so much @Elrod. The updating process makes much more sense.

There was, indeed, a mistake in my code. I improperly defined the cross-entropy loss function as

y[i]*(1.0-ŷ[i]) -  y[i]*ŷ[i]

It should be

y[i]*(1.0-ŷ[i]) -  (1.0-y[i])*ŷ[i]

Now the forward gradients match the analytic gradients:

println(string("Analytic gradient: ",Gtest))
# Analytic gradient: [0.5, -0.5, 1.5]

using ForwardDiff
Gforward = ForwardDiff.gradient(x -> fg!(0,nothing,x,storage), zeros(nparams))
println(string("AutoDiff (forward) gradient: ", Gforward ))
#AutoDiff (forward) gradient: [0.5, -0.5, 1.5]
1 Like