Trouble Speeding up Vectorized Functions with Parallelism


#1

I am running a 16-core machine and and trying to parallelize some simple vectorized code. As the simplest possible example, below is a piece of test-code in which I just try to parallelize the vectorized exp function. I have noticed that I get much less than 16x speedup (about 5.5x speedup in the best-case). Is there something I am doing wrong? Or is parallelism that involves multiple cores/threads trying to edit memory at once just going to scale poorly in general? Thanks in advance for your help.

addprocs(16)

in=rand(100_000_000)
out=zeros(100_000_000)
sh_in = SharedVector{Float64}(in)
sh_out = SharedVector{Float64}(out)


function simple_update!(out,in)
    out.=in
    out .= exp.(out)
    return nothing
end

function inbounds_update!(out,in)
    out.=in
    @inbounds for ii in 1:length(out)
        out[ii] = exp(out[ii])
    end
    return nothing
end

function para_update!(out,in)
    out.=in
    @inbounds( @sync @parallel for ii in 1:length(out)
        out[ii] = exp(out[ii])
    end )
    return nothing
end


function threads_update!(out,in)
    out.=in
    @inbounds( Threads.@threads for ii in 1:length(out)
        out[ii] = exp(out[ii])
    end)
    return nothing
end


@btime simple_update!($out,$in)
@btime inbounds_update!($out,$in)
@btime para_update!($sh_out,$sh_in)
@btime threads_update!($out,$in)

Results:

simple_update: 1.395 s (0 allocations: 0 bytes)
inbounds_update: 1.458 s (0 allocations: 0 bytes)
para_update: 341.980 ms (4227 allocations: 0 bytes)
threads_update: 267.522 ms (1 allocation: 32 bytes)```

#2

For what it’s worth, with a beefy 1.7 Ghz i3:

julia> input=rand(100_000_000);

julia> output=similar(input);

julia> using Yeppp, BenchmarkTools

julia> @benchmark Yeppp.exp!($output, $input)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     249.144 ms (0.00% GC)
  median time:      250.246 ms (0.00% GC)
  mean time:        252.471 ms (0.00% GC)
  maximum time:     267.331 ms (0.00% GC)
  --------------
  samples:          20
  evals/sample:     1

If your actual application really is dominated by these sorts of functions, I’d recommend Yeppp.

A threaded function following my usual approach, and not using Yeppp because it doesn’t support array views:

julia> function threaded_exp!(y, x)
           nt = Threads.nthreads()
           n = length(y)
           Threads.@threads for i ∈ 1:nt
               r = 1+(i-1)*n÷nt:i*n÷nt
               @views y[r] .= exp.(x[r])
           end
       end
threaded_exp! (generic function with 1 method)

julia> @benchmark threaded_exp!($output, $input)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  1
  --------------
  minimum time:     845.012 ms (0.00% GC)
  median time:      846.430 ms (0.00% GC)
  mean time:        853.568 ms (0.00% GC)
  maximum time:     889.610 ms (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1

This was with 4 threads.

Now running your code (with addprocs(4) and 4 threads) for my machine’s baseline:

julia> @btime simple_update!($out,$in)
  2.163 s (0 allocations: 0 bytes)

julia> @btime inbounds_update!($out,$in)
  2.167 s (0 allocations: 0 bytes)

julia> @btime para_update!($sh_out,$sh_in)
  1.211 s (745 allocations: 29.16 KiB)

julia> @btime threads_update!($out,$in)
  1.345 s (1 allocation: 32 bytes)

Btw, why copy in to out outside the loop, instead of out[i] = exp(in[i]) directly?

Also, is it 16 physical cores, or 8 hyper-threaded to 16?


#3

Thank you for your response. Please find my responses to your questions below.

If your actual application really is dominated by these sorts of functions, I’d recommend Yeppp.

Unfortunately, I can’t use Yeppp! because my application will involve automatic differentiation, and all the AD packages I know of don’t allow calls outside of Julia. If you know of a way to use Yeppp! with AD, please do let me know.

Btw, why copy in to out outside the loop, instead of out[i] = exp(in[i]) directly?

The reason I copy in to out is that in my actual application, I would like to in-place modifications. I could have just passed a single input and just modified it in place, but this might have created problems for benchmarking (because the input would quickly go to infinity). That being said, I realize now that this is the primary source of the scaling issue: a large fraction of the time in each function is devoted to the out.=in, which is done in serial in all versions of the function. When I drop this line and put everything of the form out=exp.(in), I get the following:

Serial Vectorized: 1.467 s (0 allocations: 0 bytes)
Serial Not Vectorized: 1.475 s (0 allocations: 0 bytes)
Parallel: 179.036 ms (4685 allocations allocations: 209.66 KiB)
Threaded (mine): 127.189 ms (1 allocation: 48 bytes)
Threaded (yours): 117.513 ms (1 allocations: 64 bytes)

While not a full 16x speed-up, the scaling is much more reasonable at 12.5x.

Also, is it 16 physical cores, or 8 hyper-threaded to 16?

I am not 100%, but the /proc/cpuinfo has 16 unique physical_ids. I think this suggests that it’s 16 physical cores.

On a related note, your threaded assignment code works quite well. I’m actually a bit surprised that Julia doesn’t automatically do this or that there isn’t a built-in macro that basically implements such a function for assignment (basically a multi-threaded @dot).


#4

It’s also worth mentioning that your threaded code performs basically performs exactly as well as Yeppp.exp!() on my machine.


#5

It does. (I mean there’s no difference between the two threaded versions other than that you are using @inbounds wrong)


#6

Thanks for your comment. I take this to mean that I should have written the function as:

function threads_exp!(y,x)
   Threads.@threads for ii in 1:length(y)
        @inbounds y[ii] = exp(x[ii])
    end
    return nothing
end

Is that correct? It gets similar performance to @Elrod’s code above.

Also, to clarify, what I meant is that i would have expected either Julia to automatically call something like threads_exp!(y,x) any time you did y.=exp.(x) or for this to be common enough that there was a macro like @threads_dot y=exp(x). That being said, I understand very little about macros (evidenced by my inability to use @inbounds correctly), so this may be very silly or difficult for reasons that I don’t understand.


#7

No, that’s a totally reasonable thing to want. For some discussion of essentially what your @threads_dot would mean see: https://github.com/JuliaLang/julia/issues/19777

The issue with using threads automatically is that it assumes your function is thread-safe, which isn’t currently a requirement for regular Julia code.

I think that what @yuyichao means is that @Elrod’s implementation that manually splits up the data into nthreads() pieces is not necessary, as the @threads macro already does that for you. From the docs;

help?> Threads.@threads
  Threads.@threads

  A macro to parallelize a for-loop to run with multiple threads. This spawns nthreads() number of threads, splits the iteration space amongst them, and iterates in parallel. A barrier is placed at the end of the loop which waits for
  all the threads to finish execution, and the loop returns.

so splitting the iteration space across the correct number of threads is already done.


#8

Thanks @rdeits.

I totally understand that if code is not required to be thread-safe that it may cause a lot of problems to give us enough rope to hang ourselves. Still, it seems like it shouldn’t be too difficult to let the user declare whether their functions or code is thread-safe. In fact, if I remember correctly, functions are indexed in some array or similar structure. It seems like one could store an identically shaped array of bits indicating for each function whether it’s thread-safe or not, leaving the responsibility with the function creator (or the user) to modify that safe/not-safe Bool. Then even @dot or broadcast could simply quickly check whether the functions are all thread-safe, and then decide whether to compile a multi-threaded or single-threaded version of the function.

Based on the thread you linked, it sounds like it may be on the to-do list for future Julia versions. If so, I really look forward to it. Easy multi-threading seems like it could lead to first-order gains in terms of adding speed to most peoples’ Julia code.


#9

This does sound like something that could be opted into with a trait in the future. For example:

julia> struct NotThreadSafe; end

julia> struct ThreadSafe; end

julia> threadsafe(::Any) = NotThreadSafe() # conservative default value
threadsafe (generic function with 1 method)

julia> threadsafe(::typeof(sin)) = ThreadSafe() # this function is safe
threadsafe (generic function with 2 methods)

julia> threadsafe(sin)
ThreadSafe()

julia> function my_broadcast(f, x)
         my_broadcast(threadsafe(f), f, x)
       end
my_broadcast (generic function with 1 method)

julia> my_broadcast(::ThreadSafe, f, x) = println("using threads")
my_broadcast (generic function with 2 methods)

julia> my_broadcast(::NotThreadSafe, f, x) = println("no threads here")
my_broadcast (generic function with 3 methods)

This is the kind of trait-based dispatch that’s common in other parts of Julia and should generally not impose any runtime cost.


#10

Implicit multithreading on a problem like this takes a little bit more since you only when to multithread if it’s “sufficiently large”, which can be hard to guess and you’d have to choose a route depending on non-type information. If you wanted to hardcode a vectorized exp!function this is something that’s simple to do, but in an arbitrary dispatch mechanism that is a more difficult problem to handle. I think allowing @threads to explicitly turn on threading in a broadcast call would be a good first step though.


#11

It’s worth pointing out that Yeppp is run serially – so you should be able to get a similar factor of speed up over what Yeppp is giving you now as you did with regular exp, assuming you can break up the computations well (and run Yeppp vectorized – it is slower when run serially.)

As for AD, which library are you using?
Disclaimer: I haven’t done this before. But…

DiffRules defines derivatives and intrinsics here:


The macro used for adding diffulres updates a dictionary called DEFINED_DIFFRULES:

So, perhaps this would work: use DiffRules, define them, and then use ForwardDiff. And rerun this if you add more diffrules after using ForwardDiff:

for (M, f, arity) in DiffRules.diffrules()
    in((M, f), ((:Base, :^), (:NaNMath, :pow), (:Base, :/), (:Base, :+), (:Base, :-))) && continue
    if arity == 1
        eval(unary_dual_definition(M, f))
    elseif arity == 2
        eval(binary_dual_definition(M, f))
    else
        error("ForwardDiff currently only knows how to autogenerate Dual definitions for unary and binary functions.")
    end
end

#12

Just for completeness, here are my numbers on a 4 core 4 hyper threads machine on windows:


julia> @btime Yeppp.exp!($x, $y);
  103.867 ms (0 allocations: 0 bytes)

julia> jlexp!(x, y) = x .= exp.(y)
julia> @btime jlexp!($x, $y);
  91.063 ms (0 allocations: 0 bytes)

julia> function threads_exp!(x, y)
          Threads.@threads for ii in 1:length(y)
               @inbounds y[ii] = exp(x[ii])
           end
           return nothing
       end
julia> @btime threads_exp!($x, $y);
  24.452 ms (1 allocation: 48 bytes)

So Julia threadded works pretty well :slight_smile:


#13

Interesting that your Yeppp was the slowest. Now on a 6-core FX-6300, I still have that serial Yeppp is faster than the threaded version (and definitely not slower than the broadcast exp):

julia> @btime Yeppp.exp!($x, $y);
  309.897 ms (0 allocations: 0 bytes)

julia> @btime jlexp!($x, $y);
  1.614 s (0 allocations: 0 bytes)

julia> @btime threads_exp!($x, $y);
  370.956 ms (1 allocation: 48 bytes)

edit:
Interestingly, on a 16 core (32 total thread) Threadripper:

julia> @btime Yeppp.exp!($x, $y);
  878.023 ms (0 allocations: 0 bytes)

julia> @btime jlexp!($x, $y);
  843.170 ms (0 allocations: 0 bytes)

julia> @btime threads_exp!($x, $y);
  46.282 ms (1 allocation: 48 bytes)

Which more or less mirrors you results.
The threadripper normally has much better single core performance than my other two computers (as we can see with 0.8 vs 1.6 seconds on the broadcast version), but Yeppp is slow.
Yeppp seems really hit-or-miss when it comes to processor architecture?

For reference, I let:

y = rand(100_000_000); x = similar(y);

and swapped x and y in threads_exp! (so it says @inbounds x[ii] = exp(y[ii])).