Improve performance of a code with very large % of Garbage Collection

jmcastro2109 · January 31, 2022, 10:28pm

Hello, I am trying to squeeze performance of the following code as much as possible. When I benchmark it, I get a very large number for the garbage collection statistic (see below). Does anyone have any recommendation for addressing this issue?

using JuMP, Ipopt, LinearAlgebra, Random, Distributions, BenchmarkTools, MosekTools, NLopt, Plots, FLoops

Threads.nthreads()

const N = 100
const Ι = 250
const J = 50
const σ = 5
const α = 0.15
const S= 100


Random.seed!(2109)
zbar = rand(Ι)
variance = 2.5

Random.seed!(12345)
z= exp.(rand(Normal(-variance/2,sqrt(variance)),Ι,S)) .+ zbar


Random.seed!(123)

β = ones(J)


function distmat(J,Ι)
    Distances = zeros(J,Ι)
    Random.seed!(7777)
    coordinates_market = hcat([x for x = 1:J],fill(0.0,J))
    coordinates_plant = hcat([x for x = 1:Ι].*J/Ι,fill(0.0,Ι))
    for j = 1:J, l=1:Ι
        Distances[j,l] = sqrt((coordinates_market[j,1]-coordinates_plant[l,1])^2+(coordinates_market[j,2]-coordinates_plant[l,2])^2)
    end
    return 1 .+ Distances
end

const τ = distmat(J,Ι)


function f2(Cap,β,τ,z)
    (J,Ι) = size(τ)
    opt = JuMP.Model(optimizer_with_attributes(Mosek.Optimizer, "QUIET" => false, "INTPNT_CO_TOL_DFEAS" => 1e-7))
    set_silent(opt)
    mktsize = ((σ-1)/σ)^(σ-1)/(σ).*β.^(σ)
    @variable(opt, ν[1:Ι] >= 0)
    @variable(opt, μ[1:J]>= 0)
    @variable(opt, t[1:J]>= 0)
    @objective(opt, Min, ν'*Cap .+ mktsize'*t)

    @inbounds for i=1:Ι, j=1:J
		@constraint(opt, τ[j,i]/z[i] + τ[j,i]*ν[i] - μ[j] >= 0)
	end   

    @inbounds for j = 1 : J
        @constraint(opt, [t[j],μ[j],1] in MOI.PowerCone(1/σ))
    end    

    JuMP.optimize!(opt)

    #return objective_value(opt), value.(μ), value.(ν)

    return objective_value(opt), value.(ν)[1:Ι], value.(μ)[1:J]

end 

@views function f1(Cap,β,τ,z)
    (J,Ι) = size(τ)
    Simul = size(z)[2]
        @inbounds @floop for s = 1 : Simul
            temp = f2(Cap,β,τ,z[:,s])./Simul
            @reduce( Eπ = zero(eltype(temp[1])) + temp[1] )
            @reduce( Eν = zeros(Ι) + temp[2] )
        end  
    func = Eπ .- α*sum(Cap) 
    foc = Eν .- α
    return func, foc

end    


function myfunc_floops(x::Vector, grad::Vector)
    temp = f1(x,β,τ,z)
    if length(grad) > 0
        grad[:]=-temp[2]
    end
    return -temp[1]
end

function fopt(guess)
    opt = Opt(:LD_MMA, Ι)
    opt.lower_bounds = zeros(Ι)
    opt.xtol_rel = 1e-3
    opt.ftol_rel = 1e-8

    opt.min_objective = myfunc_floops

    (minf,minx,ret) = NLopt.optimize(opt, guess)
    numevals = opt.numevals # the number of function evaluations
    println("got $minf at $minx after $numevals iterations in (returned $ret)")

    return minx

end  

sol = @benchmark fopt(ones(Ι))
sol

with the following benchmark results

Single result which took 147.664 s (18.66% GC) to evaluate,
 with a memory estimate of 172.62 GiB, over 2295693135 allocations.

odow · February 1, 2022, 12:37am

Just to clarify:

You want to optimize using NLopt:
- A function myfunc_floops
  - That calls f1 which repeatedly calls f2
    - Which formulates and solves an optimization model using Mosek.

That’s going to involve a lot of calls to f1, which creates and destroys a lot of JuMP models. That’s the reason for your GC issue.

As recommended here:

did you try creating a single JuMP model and calling fix on it?

Have you tried creating a single large JuMP model instead of the loop over z[:,s]? You could potentially find all solutions for Simul simultaneously a single large Mosek model.

jmcastro2109 · February 1, 2022, 1:14am

Hi @odow , yes you are correct with what I am trying to do.

Answering your questions:

Yes, I tested your solution but got discouraged by the fact that I cannot parallelize it as @mtanneau pointed out.
That is interesting. Let me try to see if I understand your suggestion clearly.

I fix the vector Cap and I have S independent subproblems.
Solve for a matrix I x S and J x S of \nu and \mu, enforcing the constraints of each s.
Here, is where I am a bit lost. Since all the problems are independent of each other, maximizing them independently is the same as maximizing a linear function of them, in this case, the average.

Is this so? If yes, what is the correct way to parallelize this?

odow · February 1, 2022, 1:17am

Since all the problems are independent of each other, maximizing them independently is the same as maximizing a linear function of them, in this case, the average.

Yes. The objective would just be the sum of the individual objectives.

jmcastro2109 · February 1, 2022, 1:17am

I’ll give this a shot. Can I parallelize something here? Or how can I exploit multithreading?

odow · February 1, 2022, 1:19am

Ah, I didn’t realize your @floop. Did parallelizing make a difference? How much memory do you have?

jmcastro2109 · February 1, 2022, 1:20am

At this point, it reduces computation time in around 60%. I can run this in a big HPC server, so memory should not be a constraint.

odow · February 1, 2022, 1:21am

Follow-up questions:

Is this the final model?
Why does it need to be fast?
Are you going bigger? How much bigger?

jmcastro2109 · February 1, 2022, 1:23am

Yes this is the final model!
Because I need to solve it many times to estimate the parameters of the model.
I do not think I am going bigger at least on I and J. I do not think I am going to use a much larger S either in the bulk of the computations. Certainly S won’t be bigger than 1000.

tbeason · February 1, 2022, 1:34am

you shush with that kind of talk around here!

jmcastro2109 · February 1, 2022, 1:36am

Here is my take at your proposed solution!


function f2_alt(Cap,β,τ,z)
    (J,Ι) = size(τ)
    S = size(z)[2]
    opt = JuMP.Model(optimizer_with_attributes(Mosek.Optimizer, "QUIET" => false, "INTPNT_CO_TOL_DFEAS" => 1e-7))
    set_silent(opt)
    mktsize = ((σ-1)/σ)^(σ-1)/(σ).*β.^(σ)
    @variable(opt, ν[1:Ι,1:S] >= 0)
    @variable(opt, μ[1:J,1:S]>= 0)
    @variable(opt, t[1:J,1:S]>= 0)

    @objective(opt, 
					Min, 
					1/S*sum(ν[:,s]'*Cap .+ mktsize'*t[:,s] for s=1:S))

    @inbounds for s=1:S, i=1:Ι, j=1:J
		@constraint(opt, τ[j,i]/z[i,s] + τ[j,i]*ν[i,s] - μ[j,s] >= 0)
	end   

    @inbounds for s=1:S,j = 1 : J
        @constraint(opt, [t[j,s],μ[j,s],1] in MOI.PowerCone(1/σ))
    end    

    JuMP.optimize!(opt)

        return objective_value(opt) - α*sum(Cap) , mean(value.(ν)[1:Ι],dims=2).- α, mean(value.(μ)[1:J],dims=2)

end

It is a just a bit slower that the current code

@btime f1(ones(Ι),β,τ,z)
6.471 s (49906817 allocations: 3.76 GiB)

@btime f2_alt(ones(Ι),β,τ,z)
10.247 s (49637416 allocations: 3.52 GiB)

I am sure that if it was possible to parallelize something. Your proposal would be lightning fast.

odow · February 1, 2022, 2:16am

Have you considered doing something simpler like this? I get a nice speed-up as S increases. (I might have made a typo somewhere, so check for bugs.)

You can start Julia with multiple procs using julia -p N. You should update the /tmp/discourse directory as appropriate. In addition, I’m using SCS instead of Mosek, and on a custom branch to avoid https://github.com/jump-dev/SCS.jl/pull/244. But things should work if you swap in Mosek.Optimizer.

using Distributed

@everywhere begin
    import Pkg
    Pkg.activate("/tmp/discourse")
end

@everywhere begin
    using JuMP
    import Distributions
    import SCS
    import Random
end

@everywhere struct Data
    I::Int
    J::Int
    S::Int
    σ::Int
    α::Float64
    z::Matrix{Float64}
    β::Vector{Float64}
    τ::Matrix{Float64}
end

@everywhere function f2(data::Data, Cap::Vector{Float64}, s::Int)
    opt = Model(SCS.Optimizer)
    set_silent(opt)
    σ′ = ((data.σ - 1) / data.σ)^(data.σ - 1) / data.σ
    @variable(opt, ν[1:data.I] >= 0)
    @variable(opt, μ[1:data.J] >= 0)
    @variable(opt, t[1:data.J] >= 0)
    @objective(
        opt,
        Min,
        sum(Cap[i] * ν[i] for i in 1:data.I) +
        sum(σ′ * data.β[j]^data.σ * t[j] for j in 1:data.J),
    )
    @constraint(
        opt,
        [i = 1:data.I, j = 1:data.J],
        data.τ[j, i] * ν[i] - μ[j] >= -data.τ[j, i] / data.z[i, s],
    )
    @constraint(
        opt,
        [j = 1:data.J],
        [t[j], μ[j], 1.0] in MOI.PowerCone(1 / data.σ)
    )
    optimize!(opt)
    return objective_value(opt), value.(ν)
end

function generate_data(; I = 250, J = 50, S = 100)
    σ = 5
    α = 0.15
    Random.seed!(12345)
    z = exp.(rand(Distributions.Normal(-2.5 / 2, sqrt(2.5)), I, S)) .+ rand(I)
    β = ones(J)
    function distmat(J, I)
        coordinates_market = hcat(1:J, zeros(J))
        coordinates_plant = hcat((1:I) .* J / I, zeros(I))
        return [
            1 + sqrt(
                (coordinates_market[j, 1] - coordinates_plant[l, 1])^2 +
                (coordinates_market[j, 2] - coordinates_plant[l, 2])^2
            )
            for j = 1:J, l = 1:I
        ]
    end
    τ = distmat(J, I)
    return Data(I, J, S, 5, 0.15, z, β, τ)
end

function f1(data::Data, x::Vector{Float64})
    Eπ, Eν = 0.0, zeros(data.I)
    for s in 1:data.S
        temp = f2(data, x, s)
        Eπ += temp[1]
        Eν .+= temp[2]
    end
    return Eπ / data.S - data.α * sum(x), Eν ./ data.S .- data.α
end

function parallel_f1(data::Data, x::Vector{Float64})
    pool = CachingPool(workers())
    let data = data
        ret = pmap(s -> f2(data, x, s), pool, 1:data.S)
        Eπ = sum(r[1] for r in ret) / data.S - data.α * sum(x)
        Eν = sum(r[2] for r in ret) ./ data.S .- data.α
        return Eπ , Eν ./ data.S .- data.α    
    end
end

data = generate_data(I = 25, J = 5, S = 10)

@time f1(data, ones(data.I))
@time parallel_f1(data, ones(data.I))

jmcastro2109 · February 1, 2022, 2:33am

Thanks. This looks great!

I don’t end up grasping conceptually where is the difference between your code and what I am doing. Can you elaborate a little bit more? Just to learn…

odow · February 1, 2022, 3:34am

I don’t think there is anything conceptually different. I just used fewer external packages. (For example, I don’t know how @floops parallelizes or takes care of data management, so I just used pmap. Not adding a caching pool could cause you to copy all the data to every proc every time, which might be one cause of trouble.)

Things different are:

Wrap all of the data in a single struct to avoid global variables
Avoid @inbounds and @view (I only add these if it has a demonstrable impact on performance)
Use pmap instead of @floops

jmcastro2109 · February 1, 2022, 3:37pm

This is interesting! Another thing that I find interesting is that by some reason the parallel version requires less iterations in the NLopt that the other one to get at the same solution.

jmcastro2109 · February 1, 2022, 8:56pm

@odow, one question. Is there any way to wra[ your excellent solution in another parallel loop. I am trying to use the MWE below, but I think this is not a proper solution for a nested parallelized loop.

using Distributed, BenchmarkTools, SharedArrays

addprocs(8)


@everywhere begin
    using JuMP, MosekTools, NLopt
    import Distributions
    import Random
end

@everywhere struct Data
    N::Int
    I::Int
    J::Int
    S::Int
    σ::Int
    α::Float64
    z::Array{Float64}
    β::Vector{Float64}
    τ::Matrix{Float64}
end

@everywhere function f2(data::Data, Cap::Vector{Float64}, s::Int, n::Int)
    opt = JuMP.Model(optimizer_with_attributes(Mosek.Optimizer, "QUIET" => false, "INTPNT_CO_TOL_DFEAS" => 1e-7))
    set_silent(opt)
    σ′ = ((data.σ - 1) / data.σ)^(data.σ - 1) / data.σ
    @variable(opt, ν[1:data.I] >= 0)
    @variable(opt, μ[1:data.J] >= 0)
    @variable(opt, t[1:data.J] >= 0)
    @objective(
        opt,
        Min,
        sum(Cap[i] * ν[i] for i in 1:data.I) +
        sum(σ′ * data.β[j]^data.σ * t[j] for j in 1:data.J),
    )
    @constraint(
        opt,
        [i = 1:data.I, j = 1:data.J],
        data.τ[j, i] * ν[i] - μ[j] >= -data.τ[j, i] / data.z[i, s, n],
    )
    @constraint(
        opt,
        [j = 1:data.J],
        [t[j], μ[j], 1.0] in MOI.PowerCone(1 / data.σ)
    )
    JuMP.optimize!(opt)
    return objective_value(opt), value.(ν)
end

function generate_data(;N=3, I = 250, J = 50, S = 100)
    σ = 5
    α = 0.15
    Random.seed!(12345)
    z = exp.(rand(Distributions.Normal(-2.5 / 2, sqrt(2.5)), I, S, N)) .+ rand(I)
    #z = ones(I,S)
    β = ones(J)
    function distmat(J, I)
        coordinates_market = hcat(1:J, zeros(J))
        coordinates_plant = hcat((1:I) .* J / I, zeros(I))
        return [
            1 + sqrt(
                (coordinates_market[j, 1] - coordinates_plant[l, 1])^2 +
                (coordinates_market[j, 2] - coordinates_plant[l, 2])^2
            )
            for j = 1:J, l = 1:I
        ]
    end
    τ = distmat(J, I)
    return Data(N, I, J, S, 5, 0.15, z, β, τ)
end

function f1(data::Data, x::Vector{Float64}, n::Int)
    Eπ, Eν = 0.0, zeros(data.I)
    for s in 1:data.S
        temp = f2(data, x, s, n)
        Eπ += temp[1]
        Eν .+= temp[2]
    end
    return Eπ / data.S - data.α * sum(x), Eν ./ data.S .- data.α
end

@everywhere function parallel_f1(data::Data, x::Vector{Float64}, n::Int)
    pool = CachingPool(workers())
    let data = data
        ret = pmap(s -> f2(data, x, s, n), pool, 1:data.S)
        Eπ = sum(r[1] for r in ret) / data.S - data.α * sum(x)
        Eν = sum(r[2] for r in ret) ./ data.S .- data.α
        return Eπ , Eν 
    end
end

data = generate_data(N = 3, I = 25, J = 5, S = 10)


@everywhere function tempfunc(x::Vector, grad::Vector, n::Int)
    temp = parallel_f1(data,x,n)
    if length(grad) > 0
        grad[:]=-temp[2]
    end
    return -temp[1]
end

@everywhere function opt_n(data::Data, guess::Vector{Float64}, n::Int)
    opt = Opt(:LD_MMA, data.I)
    opt.lower_bounds = zeros(data.I)
    opt.xtol_rel = 1e-3
    opt.ftol_rel = 1e-8

    opt.min_objective = (x,grad)->tempfunc(x,grad,n)

    (minf,minx,ret) = NLopt.optimize(opt, guess)
    numevals = opt.numevals # the number of function evaluations
    #println("got $minf at $minx after $numevals iterations in (returned $ret)")

    return minx
end    


function big_loop(data)
       solution = Array{Float64,2}(undef,data.I,data.N)
    for n = 1 : data.N
        solution[:,n] = opt_n(data, ones(data.I), n)
    end    
    return solution
end    

function parallel_big_loop(data)
    solution = SharedArray{Float64}(data.I,data.N)
    @sync @distributed for n = 1 : data.N
        solution[:,n] = opt_n(data, ones(data.I), n)
    end    
    return solution
end  

sol = @btime opt_n(data,ones(data.I),1)

TEST1 = @btime big_loop(data)


TEST2 = @btime parallel_big_loop(data)


TEST1 == TEST2

I had already asked a related question in this post, and I got a solution in terms of @floops, but of course did not have the pmap on the inner loop that is creating such great results.

odow · February 1, 2022, 10:02pm

Ah! You left this part out of the design…

It depends on whether N or S is bigger, and how many cores you have. You might be best running N in parallel, and having each core run over S in serial. Or you might be better running N in serial and parallelizing over S.

You could also partition pool = CachingPool(workers()) to only provide a subset of the workers based on n. (workers() just returns a Vector{Int}, so you could pass a subset of the vector.)

Assuming N is small, I would just run big_loop. Is the time sufficient? If so, move on. You can spend more time trying to optimize than it would have taken just to run in the first place.

jmcastro2109 · February 1, 2022, 10:17pm

Yes, my bad . I am really sorry.

N is larger than S. Assume N should be around 1000 and S should be around 100. I can use a lot of cores, say 256, and a lot of memory. Computational power should not be a first order constraint I think.

I guess that the best solution would be partitioning pool = CachingPool(workers()). Is there any way of doing this eficiently and smartly? I understand the core of your idea. However, since I have a larger N than cores I do not see how to implement it.

I guess that ideally, one would like to do something like the following:

Run N0, n’s concurrently and assign to each n a number of cores equal to Total Cores/N0. Right?

Thanks so much and sorry again.

P.S: is there any reason to use clear!(pool)?

jmcastro2109 · February 1, 2022, 10:19pm

@tkf

odow · February 1, 2022, 10:31pm

You’re probably best to run N in parallel over the 256 cores, and have each core use f1.

I’d do something like:

using Distributed

@everywhere begin
    using JuMP
    import Distributions
    import MosekTools
    import NLopt
    import Random
end

@everywhere struct Data
    N::Int
    I::Int
    J::Int
    S::Int
    σ::Int
    α::Float64
    z::Array{Float64}
    β::Vector{Float64}
    τ::Matrix{Float64}
end

@everywhere function f2(data::Data, Cap::Vector{Float64}, s::Int, n::Int)
    opt = Model(Mosek.Optimizer)
    set_optimizer_attribute(opt, "QUIET", false)
    set_optimizer_attribute(opt, "INTPNT_CO_TOL_DFEAS", 1e-7)
    set_silent(opt)
    σ′ = ((data.σ - 1) / data.σ)^(data.σ - 1) / data.σ
    @variable(opt, ν[1:data.I] >= 0)
    @variable(opt, μ[1:data.J] >= 0)
    @variable(opt, t[1:data.J] >= 0)
    @objective(
        opt,
        Min,
        sum(Cap[i] * ν[i] for i in 1:data.I) +
        sum(σ′ * data.β[j]^data.σ * t[j] for j in 1:data.J),
    )
    @constraint(
        opt,
        [i = 1:data.I, j = 1:data.J],
        data.τ[j, i] * ν[i] - μ[j] >= -data.τ[j, i] / data.z[i, s, n],
    )
    @constraint(
        opt,
        [j = 1:data.J],
        [t[j], μ[j], 1.0] in MOI.PowerCone(1 / data.σ)
    )
    optimize!(opt)
    return objective_value(opt), value.(ν)
end

function generate_data(; N=3, I = 250, J = 50, S = 100)
    σ = 5
    α = 0.15
    Random.seed!(12345)
    z = exp.(rand(Distributions.Normal(-2.5 / 2, sqrt(2.5)), I, S, N)) .+ rand(I)
    β = ones(J)
    function distmat(J, I)
        coordinates_market = hcat(1:J, zeros(J))
        coordinates_plant = hcat((1:I) .* J / I, zeros(I))
        return [
            1 + sqrt(
                (coordinates_market[j, 1] - coordinates_plant[l, 1])^2 +
                (coordinates_market[j, 2] - coordinates_plant[l, 2])^2
            )
            for j = 1:J, l = 1:I
        ]
    end
    τ = distmat(J, I)
    return Data(N, I, J, S, 5, 0.15, z, β, τ)
end

@everywhere function f1(data::Data, x::Vector{Float64}, n::Int)
    Eπ, Eν = 0.0, zeros(data.I)
    for s in 1:data.S
        temp = f2(data, x, s, n)
        Eπ += temp[1]
        Eν .+= temp[2]
    end
    return Eπ / data.S - data.α * sum(x), Eν ./ data.S .- data.α
end

@everywhere function opt_n(data::Data, n::Int)
    opt = NLopt.Opt(:LD_MMA, data.I)
    opt.lower_bounds = zeros(data.I)
    opt.xtol_rel = 1e-3
    opt.ftol_rel = 1e-8
    opt.min_objective = (x, grad) -> begin
        f, ∇f = f1(data, x, n)
        if length(grad) > 0
            grad .= -∇f
        end
        return -f
    end
    _, min_x, _ = NLopt.optimize(opt, ones(data.I))
    return min_x
end  

@everywhere function parallel_big_loop(data::Data)
    pool = CachingPool(workers())
    let data = data
        ret = pmap(n -> opt_n(data, n), pool, 1:data.N)
        return hcat(ret...)
    end
end

data = generate_data(N = 3, I = 25, J = 5, S = 10)
parallel_big_loop(data)

Topic		Replies	Views
Parallelizing for loop in the computation of a gradient Performance question	19	2599	February 26, 2018
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4013	February 11, 2019
JuMP, excessive memory usage? Optimization (Mathematical) question	16	3287	February 14, 2018
Speeding up repeated optimizations within a large loop Performance economics , optim , optimization	7	876	June 25, 2020
Performance of Computation of Expectation of slow function Optimization (Mathematical) jump , performance , optimization	13	1035	January 30, 2022

Improve performance of a code with very large % of Garbage Collection

Related topics