Thread setup time eating up all parallel improvement?

Hi,
Just experimenting a bit with parallel execution, using this example here, a straightforward parallel-and-collect results.
The issue here is that it’s not really improving wallclock time, although it’s spending less time in the measured part. Here’re the measurements:

 jmerelo@camerelo  ~/Code/julia/BraveNewAlgorithm.jl   main  julia --threads 2 examples/multi_only_crossover.jl
  Activating project at `~/Code/julia/BraveNewAlgorithm.jl`
WARNING: using Distances.pairwise in module BraveNewAlgorithm conflicts with an existing identifier.
[ Info: Number of threads -> 2
[ Info: Reading parameters file
  0.671818 seconds (15.27 M allocations: 718.511 MiB, 60.20% gc time, 16.01% compilation time)
[ Info: All offspring -> 1000000
 jmerelo@camerelo  ~/Code/julia/BraveNewAlgorithm.jl   main  time julia --threads 2 examples/multi_only_crossover.jl
  Activating project at `~/Code/julia/BraveNewAlgorithm.jl`
WARNING: using Distances.pairwise in module BraveNewAlgorithm conflicts with an existing identifier.
[ Info: Number of threads -> 2
[ Info: Reading parameters file
  0.672166 seconds (15.27 M allocations: 718.561 MiB, 60.24% gc time, 15.95% compilation time)
[ Info: All offspring -> 1000000
julia --threads 2 examples/multi_only_crossover.jl  7,36s user 0,35s system 117% cpu 6,559 total
 jmerelo@camerelo  ~/Code/julia/BraveNewAlgorithm.jl   main  time julia --threads 4 examples/multi_only_crossover.jl
  Activating project at `~/Code/julia/BraveNewAlgorithm.jl`
WARNING: using Distances.pairwise in module BraveNewAlgorithm conflicts with an existing identifier.
[ Info: Number of threads -> 4
[ Info: Reading parameters file
  0.545097 seconds (15.29 M allocations: 718.505 MiB, 63.35% gc time, 51.88% compilation time)
[ Info: All offspring -> 1000000
julia --threads 4 examples/multi_only_crossover.jl  7,69s user 0,35s system 130% cpu 6,178 total
 jmerelo@camerelo  ~/Code/julia/BraveNewAlgorithm.jl   main  time julia --threads 8 examples/multi_only_crossover.jl
  Activating project at `~/Code/julia/BraveNewAlgorithm.jl`
WARNING: using Distances.pairwise in module BraveNewAlgorithm conflicts with an existing identifier.
[ Info: Number of threads -> 8
[ Info: Reading parameters file
  0.448228 seconds (15.29 M allocations: 717.147 MiB, 61.91% gc time, 73.29% compilation time)
[ Info: All offspring -> 1000000
julia --threads 8 examples/multi_only_crossover.jl  8,51s user 0,38s system 148% cpu 5,974 total

So it looks like the small improvement it’s getting is being eaten up by the setting up of the threads, which seem to need circa .2 seconds to set up each. Is that a ballpark estimation that checks out, or am I getting something wrong here?

Writing code that executes in parallel is always a challenge… Before doing that, always try to optimize single threaded execution:

  • reduce allocations (AI can help you to re-write small functions such that they are allocation free)
  • use StaticArrays for small arrays
  • access arrays in memory order
  • write SIMD (single instruction, multiple data) friendly code, for example avoid if-then statements and use ifelse instead of if for element-wise conditions

You will always have some one-time overhead. If parallel execution is worth the one time costs always depends on the problem. 200ms to set up one thread sounds a lot, though.

1 Like

Thanks a lot for your answer; however, what you mention would seem to help in improving the speed up with the number of threads (which I seem to need anyway), but the issue here seems to be that the wallclock time is increasing with the number of threads.
It could be that garbage collection is taking place outside the loop, and it needs to take care of all the task results that are being fetch; the larger, the more time devoted to GC… So at the end of the day, the problem could be that @spawn is not designed for tasks that produce large outputs…

You could also try Polyester.@batch. See: GitHub - JuliaSIMD/Polyester.jl: The cheapest threads you can find!

No idea if that is applicable for your problem. But if applicable, it has a much lower overhead.

Not really, I’ve made it return only the size of the generated result and there’s barely any variation. Still going from 4 to 8 threads improves the loop part by 0.04 seconds, worsens wallclock time by 0.4 seconds. There’s no variation in allocation either in size or percentage.

So how much percent GC time do you have when solving your problem single threaded?

I would recommend using Threads.@threads, and putting your code into a function. Benchmarking in global scope is rarely a good indicator of true performance.

2 Likes

Pretty much the same, around 50%; ranges from 50% to 60% depending on the number of threads, without clearly going up or down.

Well, if you have 50% GC time, don’t expect any advantage of using multiple threads. Bring it down to 15% and try again.

You could also play with this Julia parameter:

 --gcthreads=N[,M]         Use N threads for the mark phase of GC and M (0 or 1) threads for the concurrent sweeping phase of GC.
                           N is set to half of the number of compute threads and M is set to 0 if unspecified.

2 Likes

That’s not earning me much…

 jmerelo@camerelo  ~/Code/julia/BraveNewAlgorithm.jl   main  time julia -t 4 --gcthreads=4 examples/multi_only_crossover.jl
  Activating project at `~/Code/julia/BraveNewAlgorithm.jl`
WARNING: using Distances.pairwise in module BraveNewAlgorithm conflicts with an existing identifier.
[ Info: Number of threads -> 4
[ Info: Reading parameters file
  0.252001 seconds (6.29 M allocations: 296.930 MiB, 43.53% gc time, 78.28% compilation time)
[ Info: All offspring -> 400000
julia -t 4 --gcthreads=4 examples/multi_only_crossover.jl  7,47s user 0,29s system 134% cpu 5,767 total
 jmerelo@camerelo  ~/Code/julia/BraveNewAlgorithm.jl   main  time julia -t 8 --gcthreads=8 examples/multi_only_crossover.jl
  Activating project at `~/Code/julia/BraveNewAlgorithm.jl`
WARNING: using Distances.pairwise in module BraveNewAlgorithm conflicts with an existing identifier.
[ Info: Number of threads -> 8
[ Info: Reading parameters file
  0.231256 seconds (6.29 M allocations: 295.976 MiB, 44.08% gc time, 174.02% compilation time)
[ Info: All offspring -> 400000
julia -t 8 --gcthreads=8 examples/multi_only_crossover.jl  8,24s user 0,29s system 149% cpu 5,712 total

Maybe reducing GC time below 50% is all.

Maybe you could try to use Julia 1.10, seems better in GC times, see GC consumes more time in 1.11 than in 1.10 · Issue #59116 · JuliaLang/julia · GitHub.

Also, you have some compilation time in it, ideally, you should try to keep the same session for multiple runs so that you don’t measure the compilation time (which can also impact gc time). If you want to run it from the terminal each time, either create a system image for your code (or at least precompile the workload).

2 Likes

Sorry, how can I do that?
I reduced the compile time through a small trick: calling the function from outside the threads on a dummy workload; before it went up to the several hundreds %.

For proper measurement you should use a benchmarking package like BenmarkTools.jl or Chairmarks.jl. These will ensure that you don’t measure compilation times.

In general, thread setup time in julia is not very significant, but garbage collection with parallel tasks can be a challenge. Sometimes allocations are done inadvertently (which will lead to GC), and can be avoided by simple rewriting of the code. There are various other ways to avoid allocations . E.g. using pre allocated arrays, StaticArrays.jl, Bumper.jl, task local storage. If you can provide an example of your code, we might suggest improvements.

1 Like

Thanks! It’s here. Already acting on some of the issues

Quick look at your code here is the changes I think you should do :
giggest isue is :

struct Embryo{T<:AbstractArray,N<:Number}
    chromosome::T
    f_value::N
end

struct Individual{T<:AbstractArray, N<:Number, C<:Caste}
    chromosome::T
    f_value::N
    caste::C
end

reduce allocs here and fuze operations :


@views function crossover_operator(parents, cache1, cache2)
    chromosome_length = length(parents[1].chromosome)
    if length(parents[1].chromosome) != length(parents[2].chromosome)
        throw(ArgumentError("The length of the parents[1] chromosome ($(length(parents[1].chromosome))) is different from the length of the parents[2] chromosome ($(length(parents[2].chromosome)))"))
    end

    start_of_the_segment = rand(1:chromosome_length-1)
    segment_length = rand(1:chromosome_length-start_of_the_segment)
    end_of_segment = start_of_the_segment + segment_length

    segment = start_of_the_segment:end_of_segment
    offspring1 = cache1
    offspring2 = cache2
    # offspring = Matrix{Float64}(undef,chromosome_length,2)
    @inbounds for i in eachindex(offspring1,offspring2)
        @inbounds if i in segment
            offspring1[i] = parents[2].chromosome[i]
            offspring2[i] = parents[1].chromosome[i] 
        end
        offspring1[i] = parents[1].chromosome[i]
        offspring2[i] = parents[2].chromosome[i]
    end
    return (offspring1, offspring2)
end

better type the vector since we can :

function get_offspring(embryos)
    r = 1:2:length(embryos)-1
    all_offspring = Vector{Vector{Float64}}(undef,2*length(r))
    c = 1
    cache1 = zeros(length(embryos[1].chromosome))
    cache2 = zeros(length(embryos[1].chromosome))
    for i in 1:2:length(embryos)-1
        j = i + 1
        parents = (
            Individual(embryos[i].chromosome, embryos[i].f_value, ALPHA()),
            Individual(embryos[j].chromosome, embryos[j].f_value, ALPHA())
        )
        offspring = crossover_operator(parents, cache1, cache2)
        all_offspring[c] = offspring[1]
        all_offspring[c+1] = offspring[2]
        c += 2
    end
    return all_offspring
end

all this should give you 10X better time and 7 allocs in single thread, seems like its better single threaded for now, but for large enough population may be nice

1 Like

I find it difficult to understand to understand exactly what code you are running, but the second line below looks like a smoking gun:

function get_offspring(embryos)
    all_offspring = []  # <- very bad untyped container

Incrementally push!ing to an untyped container seems like it would have bad consequences in a multithreaded context (and also in single-threaded ones)

3 Likes

I agree with @yolhan_mannes. Abstract fields in structs is very bad for performance.

The reason is that functions are compiled when the types of the actual arguments are known. But with abstract fields, their full type are not known at compile time. I.e. a Number can be a Float64 or an Int or even a Complex. Because of this the compiler is forced to check the actual type at runtime, sometimes at every iteration of a loop. This check takes time, and allocates memory.

The solution is either to make the field types into parameters of the struct, or in some cases, like f_value stick to e.g. Float64 if that is feasible.

You may still run into problems if you have collections of parametric struct, and the parameters vary:

struct A{T <: Real}
    value::T
end

arr = A[]
push!(arr, A(23))
push!(arr, A(42.3))
....

If you now have a loop:

for a in arr
    <dosomethingwith a>
end

you run into problems, because the type of a is not fully known at compile time, it can be an A{Int} or A{Float32}. Sometimes such constructions can’t be avoided, but there is a trick to alleviate the problems, known as “function barrier”. It works if each iteration takes time:

for a in A
    dosomethingwith(a)
end

I.e. put the entire content of the loop into a function, to which you pass the loop variable a. There is still a lookup to find the full type of a, but an instance of dosomethingwith is called with full knowledge of the type, so it will be optimally compiled.

By the way, you can check such type instabilities with

julia> myfun(x) = (x == 0) ? 0 : 1/x
myfun (generic function with 1 method)

julia> @code_warntype myfun(2.3)
MethodInstance for myfun(::Float64)
  from myfun(x) @ Main REPL[1]:1
Arguments
  #self#::Core.Const(Main.myfun)
  x::Float64
Body::Union{Float64, Int64}
1 ─ %1 = Main.:(==)::Core.Const(==)
│   %2 = (%1)(x, 0)::Bool
└──      goto #3 if not %2
2 ─      return 0
3 ─ %5 = Main.:/::Core.Const(/)
│   %6 = (%5)(1, x)::Float64
└──      return %6

Yellow is sometimes ok, red is bad (the colors are off here at Discourse).

1 Like

There is another thing which may be a problem. I don’t known if it applies here, because I don’t know how things are used.

Your Caste abstract type with the five subtypes ALPHA .. EPSILON is nice, but may be inefficient. (I’ve had a similar structure in an entirely different context). The castes differ only by name, but are different types. This can cause dispatch performance problems if you have a collection of different castes. If this causes problems it might be better to collect them into a common concrete type, with either an enum or Int for distinguishing them:

@enum Castes Alpha Beta Gamma Delta Epsilon 
struct Caste
    id::Castes
end
ALPHA() = Caste(Alpha)
BETA() = Caste(Beta)
...

Now, this is awkward, and more complicated than the elegant subtype solution. And you’ll have to write a function to get the name as an uppercase string:

castename(c::Caste) = uppercase(string(c.id))
# or
@inline Base.getproperty(c::Caste, sym::Symbol) = (sym === :name) ? uppercase(string(getfield(c, :id))) : getfield(c, sym)

The latter allowing the syntax c.name, if the interface must be preserved. But it avoids type instabilities if you have collections of different castes which you need to iterate over.

Manually creating if ... elseif ... elseif to have different behaviour for different castes is more work than using the dispatch mechanism, but it’s typically faster because dynamic dispatch is too general (it has to account for some user creating more subtypes of Caste with complicated parameters, even if this isn’t supposed to happen).

1 Like

I actually don’t need the names, just the enums. I’m refactoring pretty extensively how, will take this into account. Thanks!

Posted a PR on the perfs and for some reason test didn’t run so fixed it you can just run ]test now.

1 Like