Gradient-based optimization when gradient is only a function of y = f(x)

dieg0 · April 30, 2024, 11:08pm

Hi all, I am new to Julia, coming from MATLAB. I am writing some optimization code that tries to find parameters by minimizing the difference between some already computed probabilities (s) and model estimated probabilities (σ) as a function of parameters (β) using Optim.jl. The main part of the code (complete reproducible code all the way below) is:

f(β) = sum((compute_σ(β) - s).^2)
function g!(G, β)
    σ = compute_σ(β)
    Δ = compute_Δ(σ)
    G .= 2 .* Δ' * (σ - s)
    return G
end
res = optimize(f, g!, zeros(K), LBFGS())

Note, however, that compute_σ(β) is being called both on f and g!. In practice, this means that the gradient f' is only a function of y=f(x) and not of x directly. So one can compute the gradient using already estimated quantities. In the real code, compute_σ(β) is a very expensive computation that I would prefer not to have to do twice. Is there any way to create the same code but avoiding calculating it twice?

The full code is below. For those interested, this is trying to implement a multinomial logit as in BLP, 1995.

using Optim, Random, ForwardDiff, BenchmarkTools
Random.seed!(1234)

function accumarray!(out, ids, val)
    for i in eachindex(ids, val)
        out[ids[i]] += val[i]
    end
    return out
end
accumarray(ids, val) = accumarray!(zeros(eltype(val), ids[end]), ids, val)

function compute_σ(β)
    Xβ = X*β
    exp_Xβ = exp.(Xβ)
    sum_Xβ = accumarray(id_mkt, exp_Xβ)
    σ = exp_Xβ ./ sum_Xβ[id_mkt]
    return σ
end

function compute_Δ(σ)
    mult_mat = σ .* X


    deriv_wide = zeros(eltype(σ), maximum(id_mkt), size(X, 2))
    for k in axes(X, 2)
        deriv_wide[:, k] = accumarray(id_mkt, view(mult_mat,:,k))
    end

    Δ = mult_mat .- σ .* deriv_wide[id_mkt, :]
    return Δ
end

N = 1000; # number of mkts
J = 30; # number of products per market
K = 4; # number of product characteristics

# Define product characteristics (and normalize one product to 0)
X = randn(Float64, N*(J+1), K);
X[1:J+1:N*(J+1), :] .= 0.0;

# Define mkt_ids equivalent to id_fd_mkt and id_fd_ind
id_mkt = [i for i in 1:(N) for j = 1:(J+1)];

# Define truth to estimate
β_true = randn(K)
s = compute_σ(β_true)

# Define objective function
f(β) = sum((compute_σ(β) - s).^2)
function g!(G, β)
    σ = compute_σ(β)
    Δ = compute_Δ(σ)
    G .= 2 .* Δ' * (σ - s)
    return G
end

# Optimize
res = optimize(f, g!, zeros(K), LBFGS())

odow · May 1, 2024, 2:25am

You can avoid calling compute_σ multiple times using memorization:

julia> function tricky_function(x)
           println("calling tricky function: $x")
           return exp.(x)
       end
tricky_function (generic function with 1 method)

julia> f(x) = sum(tricky_function(x).^2)
f (generic function with 1 method)

julia> function g!(G, x)
           G .= 2 .* tricky_function(x).^2
           return G
       end
g! (generic function with 1 method)

julia> function tricky_function(x)
           println("calling tricky function: $x")
           return exp.(x)
       end
tricky_function (generic function with 1 method)

julia> cache = Dict{Vector{Float64},Vector{Float64}}()
Dict{Vector{Float64}, Vector{Float64}}()

julia> function cached_tricky_function(x)
           return get!(cache, x) do
               return tricky_function(x)
           end
       end
cached_tricky_function (generic function with 1 method)

julia> cached_f(x) = sum(cached_tricky_function(x).^2)
cached_f (generic function with 1 method)

julia> function cached_g!(G, x)
           G .= 2 .* cached_tricky_function(x).^2
           return G
       end
cached_g! (generic function with 1 method)

julia> empty!(cache)
Dict{Vector{Float64}, Vector{Float64}}()

julia> x = [1.0, 2.0]
2-element Vector{Float64}:
 1.0
 2.0

julia> G = [NaN, NaN]
2-element Vector{Float64}:
 NaN
 NaN

julia> for i in 1:2
           f(x)
           g!(G, x)
       end
calling tricky function: [1.0, 2.0]
calling tricky function: [1.0, 2.0]
calling tricky function: [1.0, 2.0]
calling tricky function: [1.0, 2.0]

julia> for _ in 1:2
           cached_f(x)
           cached_g!(G, x)
       end
calling tricky function: [1.0, 2.0]

julia> x = [3.0, 4.0]
2-element Vector{Float64}:
 3.0
 4.0

julia> for i in 1:2
           f(x)
           g!(G, x)
       end
calling tricky function: [3.0, 4.0]
calling tricky function: [3.0, 4.0]
calling tricky function: [3.0, 4.0]
calling tricky function: [3.0, 4.0]

julia> for _ in 1:2
           cached_f(x)
           cached_g!(G, x)
       end
calling tricky function: [3.0, 4.0]

julia> cached_f([1.0, 2.0])
61.987206132074895

julia> f([1.0, 2.0])
calling tricky function: [1.0, 2.0]
61.987206132074895

See also: GitHub - JuliaCollections/Memoize.jl: ＠memoize macro for Julia

Ralph_Smith · May 1, 2024, 3:02am

It’s a bit easier to use the Optim.only_fg! wrapper:

dieg0 · May 1, 2024, 8:47pm

Thanks for this! It will be useful for a different part of the project.