I just set up a new machine with a CUDA GPU and started playing with the getting started examples from CuArrays.jl when I ran into a strange problem, not linked to CuArrays, but with BenchmarkTools and Threads.
The following code is a slightly modified version of the example from CuArrays:
using BenchmarkTools
N = 2^20
x = fill(1.0f0, N) # a vector filled with 1.o (Float32)
y = fill(2.0f0, N) # a vector filled with 2.0
function sequential_add!(y, x)
for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
return nothing
end
@btime sequential_add!(y, x)
function parallel_add!(y, x)
Threads.@spawn for i in eachindex(y, x) #The original Threads.@threads for i in eachindex(y, x) works
@inbounds y[i] += x[i]
end
return nothing
end
fill!(y, 2.0f0)
@btime parallel_add!(y, x)
On my machine (Windows 10, Julia 1.3, official binaries), this quite repeatably hangs Julia on the last line. Well, at least the CPU jumps to ~60% for 30-40sec, then drops to ~30% and stays there for a minute or two until my patience runs out and I kill the process.
Using @time
instead of @btime
or @benchmark
works fine, as does returning the modified line in parallel_add!
to its original form of using Threads.@threads
instead of Threads.@spawn
. It seems to only be the combination of @spawn
and @btime
or @benchmark
that causes the “hang”.
Could anyone please either explain what is happening, or point me to somewhere I can read up on this?
Thanks!
You are spawning one task for every element in the array and the task then does one addition. That will cause a tremendous overhead! Maybe you meant to use the Threads.@threads
macro which divides up the loop in equal chunks and runs the chunk in parallel. Otherwise, you need to have more work per task for it to be beneficial.
3 Likes
Well that certainly wasn’t the intention… Thanks. A little reading on my part seems to be required so I don’t tie myself into knots with this again.
Thanks!
Actually, rereading a bit more carefully. I might have been wrong. I wonder if this shouldn’t just have spawned one task that did the whole loop?
Just to clarify though: why does this apparently not affect @time
, but only @btime
? Is it the multiple executions triggered by @btime
?
Ok, giving it another try. I think what happens here is that you have nothing waiting on the task you spawn so the function returns immidiately and BenchmarkTools create a huge number of tasks in its benchmark loop because it cannot know how long a task actually take to run.
Something like
function parallel_add!(y, x)
t = Threads.@spawn for i in eachindex(y, x) #The original Threads.@threads for i in eachindex(y, x) works
@inbounds y[i] += x[i]
end
wait(t)
return nothing
end
seems to work.
4 Likes
Thanks for effort in clarifying this.
Just to close this out, on the vague chance anyone was interested:
using BenchmarkTools
using CUDAdrv, CUDAnative, CuArrays
N = 2^20
x = fill(1.0f0, N) # a vector filled with 1.o (Float32)
y = fill(2.0f0, N) # a vector filled with 2.0
function sequential_add!(y, x)
for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
return nothing
end
@info "Sequential add: "
@btime sequential_add!(y, x)
# 149.800 ÎĽs (0 allocations: 0 bytes)
function parallel_add!(y, x)
Threads.@threads for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
return nothing
end
fill!(y, 2.0f0)
@info "Parallel add 1 (@threads): "
@btime parallel_add!(y, x)
# 51.800 ÎĽs (29 allocations: 3.44 KiB)
function parallel_add2!(y, x)
t = Threads.@spawn for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
wait(t)
return nothing
end
fill!(y, 2.0f0)
@info "Parallel add 2 (@spawn): "
@btime parallel_add2!(y, x)
# 157.400 ÎĽs (9 allocations: 912 bytes)
x_d = CuArrays.fill(1.0f0, N)
y_d = CuArrays.fill(2.0f0, N)
function add_broadcast!(y, x)
CuArrays.@sync y .+= x
return
end
@info "Broadcast GPU array add (@sync): "
@btime add_broadcast!(y_d, x_d)
# 68.001 ÎĽs (61 allocations: 2.34 KiB)
function gpu_add1!(y, x)
for i = 1:length(y)
@inbounds y[i] += x[i]
end
return nothing
end
fill!(y_d, 2.0f0)
@info "GPU kernel add (@cuda): "
@btime @cuda gpu_add1!(y_d, x_d)
# 5.217 ÎĽs (48 allocations: 1.59 KiB)
So, today’s lesson is: whether on the CPU or GPU, there is probably a faster and a slower way of doing something. I expect that which is which in each case depends entirely on the application. Also, I need to put in more effort to understand the subtle differences.
Thanks again to @kristoffer.carlsson for helping explain the puzzle with benchmarking @spawn above. The explanation was quite educational.
1 Like