I am running a 16-core machine and and trying to parallelize some simple vectorized code. As the simplest possible example, below is a piece of test-code in which I just try to parallelize the vectorized exp function. I have noticed that I get much less than 16x speedup (about 5.5x speedup in the best-case). Is there something I am doing wrong? Or is parallelism that involves multiple cores/threads trying to edit memory at once just going to scale poorly in general? Thanks in advance for your help.
addprocs(16)
in=rand(100_000_000)
out=zeros(100_000_000)
sh_in = SharedVector{Float64}(in)
sh_out = SharedVector{Float64}(out)
function simple_update!(out,in)
out.=in
out .= exp.(out)
return nothing
end
function inbounds_update!(out,in)
out.=in
@inbounds for ii in 1:length(out)
out[ii] = exp(out[ii])
end
return nothing
end
function para_update!(out,in)
out.=in
@inbounds( @sync @parallel for ii in 1:length(out)
out[ii] = exp(out[ii])
end )
return nothing
end
function threads_update!(out,in)
out.=in
@inbounds( Threads.@threads for ii in 1:length(out)
out[ii] = exp(out[ii])
end)
return nothing
end
@btime simple_update!($out,$in)
@btime inbounds_update!($out,$in)
@btime para_update!($sh_out,$sh_in)
@btime threads_update!($out,$in)
Results:
simple_update: 1.395 s (0 allocations: 0 bytes)
inbounds_update: 1.458 s (0 allocations: 0 bytes)
para_update: 341.980 ms (4227 allocations: 0 bytes)
threads_update: 267.522 ms (1 allocation: 32 bytes)```
julia> input=rand(100_000_000);
julia> output=similar(input);
julia> using Yeppp, BenchmarkTools
julia> @benchmark Yeppp.exp!($output, $input)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 249.144 ms (0.00% GC)
median time: 250.246 ms (0.00% GC)
mean time: 252.471 ms (0.00% GC)
maximum time: 267.331 ms (0.00% GC)
--------------
samples: 20
evals/sample: 1
If your actual application really is dominated by these sorts of functions, Iâd recommend Yeppp.
A threaded function following my usual approach, and not using Yeppp because it doesnât support array views:
julia> function threaded_exp!(y, x)
nt = Threads.nthreads()
n = length(y)
Threads.@threads for i â 1:nt
r = 1+(i-1)*nánt:i*nánt
@views y[r] .= exp.(x[r])
end
end
threaded_exp! (generic function with 1 method)
julia> @benchmark threaded_exp!($output, $input)
BenchmarkTools.Trial:
memory estimate: 64 bytes
allocs estimate: 1
--------------
minimum time: 845.012 ms (0.00% GC)
median time: 846.430 ms (0.00% GC)
mean time: 853.568 ms (0.00% GC)
maximum time: 889.610 ms (0.00% GC)
--------------
samples: 6
evals/sample: 1
This was with 4 threads.
Now running your code (with addprocs(4) and 4 threads) for my machineâs baseline:
julia> @btime simple_update!($out,$in)
2.163 s (0 allocations: 0 bytes)
julia> @btime inbounds_update!($out,$in)
2.167 s (0 allocations: 0 bytes)
julia> @btime para_update!($sh_out,$sh_in)
1.211 s (745 allocations: 29.16 KiB)
julia> @btime threads_update!($out,$in)
1.345 s (1 allocation: 32 bytes)
Btw, why copy in to out outside the loop, instead of out[i] = exp(in[i]) directly?
Also, is it 16 physical cores, or 8 hyper-threaded to 16?
Thank you for your response. Please find my responses to your questions below.
If your actual application really is dominated by these sorts of functions, Iâd recommend Yeppp.
Unfortunately, I canât use Yeppp! because my application will involve automatic differentiation, and all the AD packages I know of donât allow calls outside of Julia. If you know of a way to use Yeppp! with AD, please do let me know.
Btw, why copy in to out outside the loop, instead of out[i] = exp(in[i]) directly?
The reason I copy in to out is that in my actual application, I would like to in-place modifications. I could have just passed a single input and just modified it in place, but this might have created problems for benchmarking (because the input would quickly go to infinity). That being said, I realize now that this is the primary source of the scaling issue: a large fraction of the time in each function is devoted to the out.=in, which is done in serial in all versions of the function. When I drop this line and put everything of the form out=exp.(in), I get the following:
Serial Vectorized: 1.467 s (0 allocations: 0 bytes)
Serial Not Vectorized: 1.475 s (0 allocations: 0 bytes)
Parallel: 179.036 ms (4685 allocations allocations: 209.66 KiB)
Threaded (mine): 127.189 ms (1 allocation: 48 bytes)
Threaded (yours): 117.513 ms (1 allocations: 64 bytes)
While not a full 16x speed-up, the scaling is much more reasonable at 12.5x.
Also, is it 16 physical cores, or 8 hyper-threaded to 16?
I am not 100%, but the /proc/cpuinfo has 16 unique physical_ids. I think this suggests that itâs 16 physical cores.
On a related note, your threaded assignment code works quite well. Iâm actually a bit surprised that Julia doesnât automatically do this or that there isnât a built-in macro that basically implements such a function for assignment (basically a multi-threaded @dot).
Thanks for your comment. I take this to mean that I should have written the function as:
function threads_exp!(y,x)
Threads.@threads for ii in 1:length(y)
@inbounds y[ii] = exp(x[ii])
end
return nothing
end
Is that correct? It gets similar performance to @Elrodâs code above.
Also, to clarify, what I meant is that i would have expected either Julia to automatically call something like threads_exp!(y,x) any time you did y.=exp.(x) or for this to be common enough that there was a macro like @threads_dot y=exp(x). That being said, I understand very little about macros (evidenced by my inability to use @inbounds correctly), so this may be very silly or difficult for reasons that I donât understand.
The issue with using threads automatically is that it assumes your function is thread-safe, which isnât currently a requirement for regular Julia code.
I think that what @yuyichao means is that @Elrodâs implementation that manually splits up the data into nthreads() pieces is not necessary, as the @threads macro already does that for you. From the docs;
help?> Threads.@threads
Threads.@threads
A macro to parallelize a for-loop to run with multiple threads. This spawns nthreads() number of threads, splits the iteration space amongst them, and iterates in parallel. A barrier is placed at the end of the loop which waits for
all the threads to finish execution, and the loop returns.
so splitting the iteration space across the correct number of threads is already done.
I totally understand that if code is not required to be thread-safe that it may cause a lot of problems to give us enough rope to hang ourselves. Still, it seems like it shouldnât be too difficult to let the user declare whether their functions or code is thread-safe. In fact, if I remember correctly, functions are indexed in some array or similar structure. It seems like one could store an identically shaped array of bits indicating for each function whether itâs thread-safe or not, leaving the responsibility with the function creator (or the user) to modify that safe/not-safe Bool. Then even @dot or broadcast could simply quickly check whether the functions are all thread-safe, and then decide whether to compile a multi-threaded or single-threaded version of the function.
Based on the thread you linked, it sounds like it may be on the to-do list for future Julia versions. If so, I really look forward to it. Easy multi-threading seems like it could lead to first-order gains in terms of adding speed to most peoplesâ Julia code.
This does sound like something that could be opted into with a trait in the future. For example:
julia> struct NotThreadSafe; end
julia> struct ThreadSafe; end
julia> threadsafe(::Any) = NotThreadSafe() # conservative default value
threadsafe (generic function with 1 method)
julia> threadsafe(::typeof(sin)) = ThreadSafe() # this function is safe
threadsafe (generic function with 2 methods)
julia> threadsafe(sin)
ThreadSafe()
julia> function my_broadcast(f, x)
my_broadcast(threadsafe(f), f, x)
end
my_broadcast (generic function with 1 method)
julia> my_broadcast(::ThreadSafe, f, x) = println("using threads")
my_broadcast (generic function with 2 methods)
julia> my_broadcast(::NotThreadSafe, f, x) = println("no threads here")
my_broadcast (generic function with 3 methods)
This is the kind of trait-based dispatch thatâs common in other parts of Julia and should generally not impose any runtime cost.
Implicit multithreading on a problem like this takes a little bit more since you only when to multithread if itâs âsufficiently largeâ, which can be hard to guess and youâd have to choose a route depending on non-type information. If you wanted to hardcode a vectorized exp!function this is something thatâs simple to do, but in an arbitrary dispatch mechanism that is a more difficult problem to handle. I think allowing @threads to explicitly turn on threading in a broadcast call would be a good first step though.
Itâs worth pointing out that Yeppp is run serially â so you should be able to get a similar factor of speed up over what Yeppp is giving you now as you did with regular exp, assuming you can break up the computations well (and run Yeppp vectorized â it is slower when run serially.)
As for AD, which library are you using?
Disclaimer: I havenât done this before. ButâŚ
So, perhaps this would work: use DiffRules, define them, and then use ForwardDiff. And rerun this if you add more diffrules after using ForwardDiff:
for (M, f, arity) in DiffRules.diffrules()
in((M, f), ((:Base, :^), (:NaNMath, :pow), (:Base, :/), (:Base, :+), (:Base, :-))) && continue
if arity == 1
eval(unary_dual_definition(M, f))
elseif arity == 2
eval(binary_dual_definition(M, f))
else
error("ForwardDiff currently only knows how to autogenerate Dual definitions for unary and binary functions.")
end
end
Interesting that your Yeppp was the slowest. Now on a 6-core FX-6300, I still have that serial Yeppp is faster than the threaded version (and definitely not slower than the broadcast exp):
julia> @btime Yeppp.exp!($x, $y);
309.897 ms (0 allocations: 0 bytes)
julia> @btime jlexp!($x, $y);
1.614 s (0 allocations: 0 bytes)
julia> @btime threads_exp!($x, $y);
370.956 ms (1 allocation: 48 bytes)
edit:
Interestingly, on a 16 core (32 total thread) Threadripper:
julia> @btime Yeppp.exp!($x, $y);
878.023 ms (0 allocations: 0 bytes)
julia> @btime jlexp!($x, $y);
843.170 ms (0 allocations: 0 bytes)
julia> @btime threads_exp!($x, $y);
46.282 ms (1 allocation: 48 bytes)
Which more or less mirrors you results.
The threadripper normally has much better single core performance than my other two computers (as we can see with 0.8 vs 1.6 seconds on the broadcast version), but Yeppp is slow.
Yeppp seems really hit-or-miss when it comes to processor architecture?
For reference, I let:
y = rand(100_000_000); x = similar(y);
and swapped x and y in threads_exp! (so it says @inbounds x[ii] = exp(y[ii])).