GC time issues when parallelizing tyalorinteg?

Hi all,

I was messing around with Threads.@thread as I am trying to parallelize a for loop : I want to compute taylorinteg from TaylorIntegration.jl several times, independently, for different initial conditions. To do this, I wrote a MWE reproducing the Kepler example in Jupyter Notebook Viewer

Here is the piece of code :

using TaylorIntegration

const μ = 1.0
const q0 = [0.19999999999999996, 0.0, 0.0, 3.0] # a initial condition for elliptical motion
const order = 28
const t0 = 0.0
const t_max = 10*(2π) # we are just taking a wild guess about the period ;)
const abs_tol = 1.0E-20
const steps = 500000

const r_p3d2 = TaylorSeries.Taylor1{Float64}

#the equations of motion for the Kepler problem:
function kepler!(dq, q, params, t)
    r_p3d2 = (q[1]^2+q[2]^2)^(3/2)
    
    dq[1] = q[3]
    dq[2] = q[4]
    dq[3] = -μ*q[1]/r_p3d2
    dq[4] = -μ*q[2]/r_p3d2
    
    nothing
end

function task()
    t, _ = taylorinteg(kepler!, q0, t0, t_max, order, abs_tol, maxsteps=steps)
    return t[end]
end

function f_par(x)
    xn = zeros(x)
    Threads.@threads for i in 1:x
        xn[i] = task()
    end
    return nothing
end


function f(x)
    xn = zeros(x)
    for i in 1:x
        xn[i] = task()
    end
    return nothing
end

However, when I try it with @time, the parallelized version (I use 72 CPU hearts on the server) isn’t much faster, mainly because of GC time :

julia> @time f(1000)
 27.500834 seconds (198.44 M allocations: 55.672 GiB, 3.21% gc time)

julia> @time f_par(1000)
 26.797330 seconds (198.44 M allocations: 55.672 GiB, 76.07% gc time)

One more problem : the GC time varies greatly, seemingly randomly (3% to 67% for the non-parallelized version for instance). Can you tell if it is a consequence of server activity, my code, or intern to taylorinteg ?

(This is on Julia 1.6.5)

Thanks !

The TaylorIntegration library seem to do a lot of allocations:

julia> @time task()
  0.018372 seconds (198.46 k allocations: 52.447 MiB, 14.05% gc time)
62.83185307179586

With more threads running concurrently, there allocations / time will increase to the point where the GC basically cannot keep up. I think the library needs to be optimized a bit to reduce the number of allocations.

2 Likes

This makes sense, thanks !

Do you have an idea about the instability of the GC time percentage ?

julia> @time f(10)
  0.431646 seconds (1.98 M allocations: 570.078 MiB)

julia> @time f(10)
  0.903906 seconds (1.98 M allocations: 570.078 MiB, 52.85% gc time)

julia> @time f(10)
  0.480541 seconds (1.98 M allocations: 570.078 MiB, 11.01% gc time)

The GC has various heuristics which relate to the age of objects and total memory allocated etc. So it isn’t too surprising that it varies between runs.

1 Like

OK, got it. Thank you very much!

Indeed, TaylorIntegration allocates a lot.

Yet, if you use the macro @taylorize to parse your ODEs, things become better:

using TaylorIntegration

const μ = 1.0
const q0 = [0.19999999999999996, 0.0, 0.0, 3.0] # a initial condition for elliptical motion
const order = 28
const t0 = 0.0
const t_max = 10*(2π) # we are just taking a wild guess about the period ;)
const abs_tol = 1.0E-20
const steps = 500000

@taylorize function kepler!(dq, q, params, t)
    r_p3d2 = (q[1]^2+q[2]^2)^(3/2)
    
    dq[1] = q[3]
    dq[2] = q[4]
    dq[3] = -(μ*q[1])/r_p3d2  # parenthesis needed to help `@taylorize`
    dq[4] = -(μ*q[2])/r_p3d2
    
    nothing
end

function task()
    t, _ = taylorinteg(kepler!, q0, t0, t_max, order, abs_tol, maxsteps=steps)
    return t[end]
end

function task_noparse()
    t, _ = taylorinteg(kepler!, q0, t0, t_max, order, abs_tol, maxsteps=steps, parse_eqs=false)
    return t[end]
end

Then I get

@time task()  # second run of task()
  0.003598 seconds (2.31 k allocations: 19.578 MiB)

@time task_noparse() # second run of task()
  0.144778 seconds (198.42 k allocations: 52.445 MiB, 64.56% gc time)

Allocations are improved almost by a factor 2.6, and the time elapsed is reduced by a factor 40. I am using Julia 1.8 and TaylorIntegration v0.9.1.

1 Like

Quite the late reply, but it got much better this way indeed! Using this in the f and f_par functions, with 8 threads, I now have about a 4-5 better speed with the parallelized version, as expected. GC also takes about 7% of time for both cases. Case solved, thank you!

2 Likes

Happy it was helpful! Notice that in the last released version, @taylorize has been improved, specially in managing allocations.

1 Like