Help reduce large gc time


#1

Hi,

In PDMP.jl, I have some large gc time reported. I looked at the performance tips and it seems my code is not type stable. For example, considering the example tcp.jl,
I find that pdmp! is not type stable even if the return type is always pdmpResult. I could anotate the return type like pdmp!(,,,)::pdmpResult but I dont understand why Julia is not able to do so.

Can anyone help me understand this please?

If you have further suggestions on my function pdmp!, I am also interested.

Thank you for your help,

Best,

result3=  @code_warntype PDMP.pdmp!(xc0,xd0,F_tcp!,R_tcp!,nu_tcp,parms,0.0,tf,ode=:lsoda,n_jumps = 100)
Body::Any
 1 ── %1  = PDMP.pdmp!::typeof(pdmp!)                                       β”‚β•»       #pdmp!#37
 β”‚    %2  = PDMP.Delta_dummy::typeof(PDMP.Delta_dummy)                      β”‚β”‚
 β”‚    %3  = Ο€ (false, Bool)                                                 β”‚β”‚β•»       #pdmp!
 β”‚    %4  = (Base.getfield)(#temp#, :ode)::Symbol                           β”‚β”‚β”‚β•»       getindex
 β”‚    %5  = Ο€ (:chv, Symbol)                                                β”‚β”‚β”‚
 β”‚    %6  = (Base.getfield)(#temp#, :n_jumps)::Int64                        β”‚β”‚β”‚β•»       getindex
 β”‚          (Base.ifelse)(true, 1, -2)                                      β”‚β”‚β”‚β”‚β•»β•·      Type
 β”‚    %8  = %new(UnitRange{Int64}, -1, 1)::UnitRange{Int64}                 β”‚β”‚β”‚β”‚β”‚
 β”‚          (Base.ifelse)(true, 1, -2)                                      β”‚β”‚β”‚β”‚β•»β•·      Type
 β”‚    %10 = %new(UnitRange{Int64}, -1, 1)::UnitRange{Int64}                 β”‚β”‚β”‚β”‚β”‚
 β”‚    %11 = Ο€ (1.0, Float64)                                                β”‚β”‚β”‚
 β”‚    %12 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Any,1}, svec(Any, Int64), :(:ccall), 2, Array{Any,1}, 0, 0))::Array{Any,1}
 β”‚    %13 = (Base.sle_int)(1, 1)::Bool                                      β”‚β”‚β”‚β”‚β•»β•·β•·β•·    iterate
 └───       goto #3 if not %13                                              │││││┃│      iterate
 2 ── %15 = (Base.sle_int)(1, 0)::Bool                                      ││││││┃│      iterate
 └───       goto #4                                                         β”‚β”‚β”‚β”‚β”‚β”‚β”‚
 3 ──       nothing                                                         β”‚
 4 ┄─ %18 = Ο† (#2 => %15, #3 => false)::Bool                                β”‚β”‚β”‚β”‚β”‚β”‚β”‚
 └───       goto #6 if not %18                                              β”‚β”‚β”‚β”‚β”‚β”‚β”‚
 5 ──       invoke Base.getindex(()::Tuple{}, 1::Int64)                     β”‚β”‚β”‚β”‚β”‚β”‚β”‚
 └───       $(Expr(:unreachable))                                           β”‚β”‚β”‚β”‚β”‚β”‚β”‚
 6 ──       goto #8                                                         β”‚β”‚β”‚β”‚β”‚β”‚β”‚
 7 ──       $(Expr(:unreachable))                                           β”‚β”‚β”‚β”‚β”‚β”‚β”‚
 8 ┄─       goto #9                                                         β”‚β”‚β”‚β”‚β”‚β”‚
 9 ──       goto #10                                                        β”‚β”‚β”‚β”‚β•»       iterate
 10 ─       goto #11                                                        β”‚β”‚β”‚β”‚
 11 ─ %27 = Ο€ (%3, Bool)                                                    β”‚β”‚β”‚
 β”‚    %28 = Ο€ (%5, Symbol)                                                  β”‚β”‚β”‚
 β”‚    %29 = Ο€ (%11, Float64)                                                β”‚β”‚β”‚
 β”‚    %30 = invoke PDMP.:(#pdmp!#35)(%27::Bool, %4::Symbol, %28::Symbol, %6::Int64, %8::UnitRange{Int64}, %10::UnitRange{Int64}, %29::Float64, %12::Array{Any,1}, %1::typeof(pdmp!), _4::Array{Float64,1}, _5::Array{Int64,1},_6::typeof(F_tcp!), _7::typeof(R_tcp!), %2::typeof(PDMP.Delta_dummy), _8::Array{Int64,2}, _9::Array{Float64,1}, _10::Float64, _11::Float64)::Any
 └───       goto #12                                                        β”‚β”‚β”‚
 12 ─       goto #13                                                        β”‚β”‚
 13 ─       return %30                 

#2

Please try to provide a minimal example that exhibits the problem. What you have shown there is just the setup to call functions with keyword arguments, which happens at

 β”‚    %30 = invoke PDMP.:(#pdmp!#35)(%27::Bool, %4::Symbol, %28::Symbol, %6::Int64, %8::UnitRange{Int64}, %10::UnitRange{Int64}, %29::Float64, %12::Array{Any,1}, %1::typeof(pdmp!), _4::Array{Float64,1}, _5::Array{Int64,1},_6::typeof(F_tcp!), _7::typeof(R_tcp!), %2::typeof(PDMP.Delta_dummy), _8::Array{Int64,2}, _9::Array{Float64,1}, _10::Float64, _11::Float64)::Any

#3
using PDMP, LinearAlgebra, Random

function F_eva!(xcdot, xc, xd, t::Float64, parms::Vector{Float64})
  # vector field used for the continuous variable
  xcdot[1] = -xc[1]+1
  nothing
end

function R_eva(rate,xc, xd, t::Float64, parms, sum_rate::Bool)
  # rate function
  rate_print = 1.
  if sum_rate == false
    if xd[1] == 0
        rate[1] = 1.0
        rate[2] = 0.0
        rate[3] = rate_print
      return 0. #transition 0->1
    else
        rate[1] = 0.0
        rate[2] = 1.0
        rate[3] = rate_print
      return 0.0 #transition 1->0
    end
  else
    if xd[1] == 0
      return 1.0 + rate_print #transition 0->1
    else
      return 1.0 + rate_print #transition 1->0
    end
  end
end

function Delta_xc_eva(xc, xd, t::Float64, parms::Vector{Float64}, ind_reaction::Int64)
  # this function return the jump in the continuous component
  if ind_reaction==2
    xc[1] = 0.0
  end
  return true
end

xc0 = vec([0.0])
xd0 = vec([0, 1])

nu_eva = [[1 0];[-1 0];[0 1]]
parms = [0.1,0.01]
tf = 100.

dummy_t =  @time PDMP.chv!(200000,xc0,xd0,F_eva!,R_eva,Delta_xc_eva,nu_eva,parms,0.0,tf,false,ode=:lsoda)

The last line have relatively often large gc time, like >50%, when run repeateadly. For example:

julia> @benchmark PDMP.chv!(200000,xc0,xd0,F_eva!,R_eva,Delta_xc_eva,nu_eva,parms,0.0,tf,false,ode=:lsoda)
BenchmarkTools.Trial: 
  memory estimate:  6.55 MiB
  allocs estimate:  9862
  --------------
  minimum time:     919.075 ΞΌs (0.00% GC)
  median time:      2.062 ms (0.00% GC)
  mean time:        2.635 ms (34.69% GC)
  maximum time:     7.660 ms (62.66% GC)
  --------------
  samples:          1894
  evals/sample:     1

#4

So if any line in the code posted is removed it stops allocating?


#5

Not exactly (I modified the code above to hopefully simplify it).
The function call I am interested in is:

PDMP.chv!(200000,xc0,xd0,F_eva!,R_eva,Delta_xc_eva,nu_eva,parms,0.0,tf,false,ode=:lsoda)

This function is allocating. In short, it calls 200_000 times an ODE solver, namely LSODA.


#6

The 3 functions F_eva!,R_eva,Delta_xc_eva are inplace. The only function allocating is PDMP.chv!. That’s the one I would like to reduce gc time


#7

So it’s actually not the code above that you want to optimize but the chv! method (which isn’t given above). I recommend you benchmark this method internally and try to extract a MWE for the allocating piece of it. Just quickly looking at the definition in PDMP.jl it is certainly allocating (you do have, for example, array slices at the rhs of assignments.) Maybe you could use views at a couple of places.


#8

How do I do this? There is β€œbetter” than @profiler and @benchmark ?


#9

There is also julia --track-allocation, see here.


#10

Salut,

1 profile (using https://github.com/timholy/ProfileView.jl)
2 look at where the bottleneck is
3 simplify to the maximum to get a minimal working example (MWE) that displays the performance problem you are concerned about

Here you’re calling an ODE solver, which necessarily allocates something (unless the solver is able to pre-allocate). This is not necessarily bad; only profiling will tell. You’re apparently passing a lot of functions around, which might be tricky for the compiler (to be checked).


#11

Using ProfileView, the longest part is the call to lsoda in LSODA.jl.
However I am not sure ProfileView is able to deal with gc time, is it? Why in some cases, I have 35% gc time when calling the same function?


#12

gc time is usually a symptom of slow code (type unstable, lots of small allocations, etc.) rather than a cause by itself. If you streamline the code, avoid β€œinference towers” (the very large recursive calls that appear when profiling type unstable code) and minimize allocations, that should get rid of the gc time.


#13

Hi,

I have been trying another idea here. In fact, to sample a PDMP, one call solve(ode) many times in a row and this generate a lot of allocations. A better solution, thanks to a suggestion from JuliaDiffEq gitter is to use an iterator to solve the ODE.

I have put this in PDMP.jl but the allocations are still high whereas it should not allocate this much.

A MWE is the following. My new code is after println(" ----> DiffEq")

using PDMP, LinearAlgebra, Random, DifferentialEquations


function F_tcp!(αΊ‹, xc, xd, t, parms)
    # vector field used for the continuous variable
    if mod(xd[1],2)==0
         αΊ‹[1] = 1.#xc[1]
    else
         αΊ‹[1] = -1.#xc[1]
    end
    nothing
end

function R_tcp!(rate, xc, xd, t, parms, sum_rate::Bool)
    # rate fonction
    if sum_rate==false
        rate[1] = 5.0/(1.0 + exp(-xc[1]/1.0 + 5.0)) + 0.1
        rate[2] = parms[1]
        return 0.
    else
        return 5.0/(1.0 + exp(-xc[1]/1.0 + 5.0)) + 0.1 + parms[1]
    end
end

xc0 = vec([0.05])
xd0 = vec([0, 1])

nu_tcp = [[1 0];[0 -1]]
parms = vec([0.1]) # sampling rate
tf = 100000.

println("--> inplace implementation,\n ----> cvode")
# more efficient way, inplace modification
Random.seed!(1234)
result2 =        PDMP.pdmp!(xc0, xd0, F_tcp!, R_tcp!, nu_tcp, parms, 0.0, tf, n_jumps = 2,   ode = :cvode)
println(result2.time)
Random.seed!(1234)
result2 =  @time PDMP.pdmp!(xc0, xd0, F_tcp!, R_tcp!, nu_tcp, parms, 0.0, tf, n_jumps = 1000, ode = :cvode)

Random.seed!(1234)
println(" ----> lsoda")
result3 =        PDMP.pdmp!(xc0, xd0, F_tcp!, R_tcp!, nu_tcp, parms, 0.0, tf, ode=:lsoda, n_jumps = 2)
println(result3.time)
Random.seed!(1234)
result3 =  @time PDMP.pdmp!(xc0, xd0, F_tcp!, R_tcp!, nu_tcp, parms, 0.0, tf, ode=:lsoda, n_jumps = 1000)

Random.seed!(1234)
println(" ----> DiffEq")

result4 =        PDMP.chv_diffeq!(xc0,xd0,
                F_tcp!,R_tcp!,PDMP.Delta_dummy,
                nu_tcp,parms,0.0,tf,false, n_jumps = 2, save_positions = (false,true))
println(result4.time)
Random.seed!(1234)
result4 =  @time PDMP.chv_diffeq!(xc0,xd0,
                F_tcp!,R_tcp!,PDMP.Delta_dummy,
                nu_tcp,parms,0.0,tf,false, n_jumps = 1000,ode = Tsit5(), save_positions = (false,false))

The result is the following:

--> inplace implementation,
 ----> cvode
[0.0, 1.77942]
  0.018689 seconds (149.66 k allocations: 5.006 MiB, 49.82% gc time)
 ----> lsoda
[0.0, 1.77942]
  0.003533 seconds (78.45 k allocations: 3.329 MiB)
 ----> DiffEq
[0.0, 1.77942]
  0.005424 seconds (137.60 k allocations: 2.231 MiB)
--> stopping time == tf? (not more) false

These call compute the same thing using:

  • CVODE for the first case, directly called from Sundials.jl
  • LSODA for the second case, directly called from LSODA.jl
  • DifferentialEquations with a new code based on the iterator idea.

The last call result4 = ... should be the one with minimal allocations but it is clearly not. You might need to have a look at here for the implementation.

If anyone has an idea, I’d be delighted :smiley:


#14

Your allocations are tiny (16 byte mean). Views are larger.

This suggests that they mostly are pointer+tag. Have you looked at @code_warntype ?

Are you using closures? Try replacing closures by callable structs. Are you running into the function specialization issue?

If one of the arguments to a function is itself <:Function, and it is not directly called (not in head position in the AST), then the compiler can refuse to specialize on the type, and this causes type instability. Then @code_warntype and even @code_native will lie be optimistic about this. I cannot resist punning about fully realized code versus really existing code (you debug the β€œKlassenlose Gesellschaft” while executing β€œRealsozialismus”).

You fix that by writing function some_functor(some_fun::T, args...) where {T<:Function} instead of function some_functor(some_fun<:Function, args...). Apart from the function specialization heuristic, both ways of writing the definition are equivalent. Or function some_functor(some_fun::T, args...) where {T} instead of function some_functor(some_fun, args...), if you don’t want to restrict to <:Function.


#15

β€œTry replacing closures by callable structs.” => this is what I am doing. The struct is there


#16

This is interesting!! Note that my struct have these functions defined as

mutable struct A
F::Function
end

and then I define

function (a::A)(x)
  return a.F(x)
end

This last callable struct is called repeatdly. Would this be problematic as well?

I did not see changes by doing:

function chv_diffeq!(xc0::AbstractVector{Float64},xd0::AbstractVector{Int64},
				F::Tf,R::Tr,DX::Td,
				nu::AbstractArray{Int64},parms::Tp,
				ti::Float64, tf::Float64,
				verbose::Bool = false;
				ode = Tsit5(),ind_save_d=-1:1,ind_save_c=-1:1,save_positions=(false,true),n_jumps::Int64 = Inf64) where {Tp, Tf <: Function,Tr <: Function,Td <: Function}

				# custom type to collect all parameters in one structure
				problem  = PDMPProblem{Float64,Int64,Tp}(xc0,xd0,F,R,DX,nu,parms,ti,tf)
				problem.save_pre_jump = save_positions[1]
				problem.verbose = verbose

				chv_diffeq!(problem,ti,tf;ode=ode,ind_save_c = ind_save_c, ind_save_d = ind_save_d, save_positions = save_positions,n_jumps = n_jumps)
end

#17

There we have the problem. Function is an abstract type, and therefore a.F(x) is dynamically dispatched and its return type cannot be inferred (sometimes unreliable constant propagation might manages to infer this even though it is impossible in general, but don’t count on it).

Write this as

mutable struct A{Fun_type}
F::Fun_type
end

or, even better, as

struct A{Fun_type}
F::Fun_type
end

You could restrict Fun_type<:Function and could also make A<:Function. But unless you have extra parameters, this is not helpful.

If you really, really must change the function in struct A at runtime (you shouldn’t! make a callable mutable struct that carries all additional parameters you need to mutate as fields), then check out https://github.com/yuyichao/FunctionWrappers.jl. This prevents all inlining and and afaik also prevents use of the julia native ABI (everything goes through the C abi).


#18

Thank you for your suggestions!

No the functions in the struct are immutable.

With the new types as pushed here, I extracted the functions into an immutable type, it removes 2/3 of the allocations. There are still 47k allocations of 17 byte. Maybe I should try to extract as much immutables as I can in another struct?

julia> include("/Users/rveltz/work/prog_gd/julia/dev/PDMP.jl/examples/tcp.jl")
--> inplace implementation,
 ----> cvode
[0.0, 1.77942]
  0.017421 seconds (149.65 k allocations: 5.006 MiB)
 ----> lsoda
[0.0, 1.77942]
  0.003582 seconds (78.45 k allocations: 3.329 MiB)
 ----> DiffEq
[0.0, 1.77942]
  0.002935 seconds (46.89 k allocations: 820.219 KiB)
--> stopping time == tf? (not more) false
#jumps = 1000

#19

Then try

struct A{Fun_type}
F::Fun_type
params::param_type
end

edit: Oh, more problems.

mutable struct PDMPProblem{Tc,Td,Tp,TF,TR,TD}
	xc::AbstractVector{Tc}		# continuous variable
	xd::AbstractVector{Td}		# discrete variable
...

This is wrong.

mutable struct PDMPProblem{vectype_xc<:AbstractVector{Tc},vectype_xd<:AbstractVector{Td},Tp,TF,TR,TD}
	xc::vectype_xc		# continuous variable
	xd::vectype_xd		# discrete variable
...

Julia does support enforced interfaces. Therefore, you must be concrete in order to permit inference: Nobody guarantees that xc[i] is of type Tc just because xc::AbstractVector{Tc}.

The general guideline is that every type should always be concrete. The only reason for abstract types to exist is that you sometimes want to write a single function that does different things, depending on the input types (one function, multiple methods), and then abstract types are nice for organizing this mess. (ok, this is a little bit of an exaggeration: knowing the abstract type speeds up dynamic dispatch compared to ::Any, but numerical code should only have dynamic dispatch at very few carefully chosen points with carefully chosen function boundaries and @noinline for easier debugging)


#20

To give an idea, this is the kind of allocations I am expecting.

julia> using DifferentialEquations

julia> function lorenz(du,u,p,t)
        du[1] = 10.0*(u[2]-u[1])
        du[2] = u[1]*(28.0-u[3]) - u[2]
        du[3] = u[1]*u[2] - (8/3)*u[3]
       end
lorenz (generic function with 1 method)

julia> u0 = [1.0;0.0;0.0]
3-element Array{Float64,1}:
 1.0
 0.0
 0.0

julia> tspan = (0.0,100.0)
(0.0, 100.0)

julia> prob = ODEProblem(lorenz,u0,(0.0,1000.0))

julia> sol = @time solve(prob,save_everystep = true)
  7.329554 seconds (23.71 M allocations: 1.185 GiB, 6.90% gc time)

julia> sol = @time solve(prob,save_everystep = false)
 0.011872 seconds (183 allocations: 12.922 KiB, 54.96% gc time)