GC time issues when parallelizing tyalorinteg?

Hi all,

I was messing around with Threads.@thread as I am trying to parallelize a for loop : I want to compute taylorinteg from TaylorIntegration.jl several times, independently, for different initial conditions. To do this, I wrote a MWE reproducing the Kepler example in Jupyter Notebook Viewer

Here is the piece of code :

using TaylorIntegration

const μ = 1.0
const q0 = [0.19999999999999996, 0.0, 0.0, 3.0] # a initial condition for elliptical motion
const order = 28
const t0 = 0.0
const t_max = 10*(2π) # we are just taking a wild guess about the period ;)
const abs_tol = 1.0E-20
const steps = 500000

const r_p3d2 = TaylorSeries.Taylor1{Float64}

#the equations of motion for the Kepler problem:
function kepler!(dq, q, params, t)
    r_p3d2 = (q[1]^2+q[2]^2)^(3/2)
    
    dq[1] = q[3]
    dq[2] = q[4]
    dq[3] = -μ*q[1]/r_p3d2
    dq[4] = -μ*q[2]/r_p3d2
    
    nothing
end

function task()
    t, _ = taylorinteg(kepler!, q0, t0, t_max, order, abs_tol, maxsteps=steps)
    return t[end]
end

function f_par(x)
    xn = zeros(x)
    Threads.@threads for i in 1:x
        xn[i] = task()
    end
    return nothing
end


function f(x)
    xn = zeros(x)
    for i in 1:x
        xn[i] = task()
    end
    return nothing
end

However, when I try it with @time, the parallelized version (I use 72 CPU hearts on the server) isn’t much faster, mainly because of GC time :

julia> @time f(1000)
 27.500834 seconds (198.44 M allocations: 55.672 GiB, 3.21% gc time)

julia> @time f_par(1000)
 26.797330 seconds (198.44 M allocations: 55.672 GiB, 76.07% gc time)

One more problem : the GC time varies greatly, seemingly randomly (3% to 67% for the non-parallelized version for instance). Can you tell if it is a consequence of server activity, my code, or intern to taylorinteg ?

(This is on Julia 1.6.5)

Thanks !

The TaylorIntegration library seem to do a lot of allocations:

julia> @time task()
  0.018372 seconds (198.46 k allocations: 52.447 MiB, 14.05% gc time)
62.83185307179586

With more threads running concurrently, there allocations / time will increase to the point where the GC basically cannot keep up. I think the library needs to be optimized a bit to reduce the number of allocations.

This makes sense, thanks !

Do you have an idea about the instability of the GC time percentage ?

julia> @time f(10)
  0.431646 seconds (1.98 M allocations: 570.078 MiB)

julia> @time f(10)
  0.903906 seconds (1.98 M allocations: 570.078 MiB, 52.85% gc time)

julia> @time f(10)
  0.480541 seconds (1.98 M allocations: 570.078 MiB, 11.01% gc time)

The GC has various heuristics which relate to the age of objects and total memory allocated etc. So it isn’t too surprising that it varies between runs.

OK, got it. Thank you very much!

Indeed, TaylorIntegration allocates a lot.

Yet, if you use the macro @taylorize to parse your ODEs, things become better:

using TaylorIntegration

const μ = 1.0
const q0 = [0.19999999999999996, 0.0, 0.0, 3.0] # a initial condition for elliptical motion
const order = 28
const t0 = 0.0
const t_max = 10*(2π) # we are just taking a wild guess about the period ;)
const abs_tol = 1.0E-20
const steps = 500000

@taylorize function kepler!(dq, q, params, t)
    r_p3d2 = (q[1]^2+q[2]^2)^(3/2)
    
    dq[1] = q[3]
    dq[2] = q[4]
    dq[3] = -(μ*q[1])/r_p3d2  # parenthesis needed to help `@taylorize`
    dq[4] = -(μ*q[2])/r_p3d2
    
    nothing
end

function task()
    t, _ = taylorinteg(kepler!, q0, t0, t_max, order, abs_tol, maxsteps=steps)
    return t[end]
end

function task_noparse()
    t, _ = taylorinteg(kepler!, q0, t0, t_max, order, abs_tol, maxsteps=steps, parse_eqs=false)
    return t[end]
end

Then I get

@time task()  # second run of task()
  0.003598 seconds (2.31 k allocations: 19.578 MiB)

@time task_noparse() # second run of task()
  0.144778 seconds (198.42 k allocations: 52.445 MiB, 64.56% gc time)

Allocations are improved almost by a factor 2.6, and the time elapsed is reduced by a factor 40. I am using Julia 1.8 and TaylorIntegration v0.9.1.

Quite the late reply, but it got much better this way indeed! Using this in the f and f_par functions, with 8 threads, I now have about a 4-5 better speed with the parallelized version, as expected. GC also takes about 7% of time for both cases. Case solved, thank you!

Happy it was helpful! Notice that in the last released version, @taylorize has been improved, specially in managing allocations.