I just set up a new machine with a CUDA GPU and started playing with the getting started examples from CuArrays.jl when I ran into a strange problem, not linked to CuArrays, but with BenchmarkTools and Threads.
The following code is a slightly modified version of the example from CuArrays:
using BenchmarkTools
N = 2^20
x = fill(1.0f0, N) # a vector filled with 1.o (Float32)
y = fill(2.0f0, N) # a vector filled with 2.0
function sequential_add!(y, x)
for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
return nothing
end
@btime sequential_add!(y, x)
function parallel_add!(y, x)
Threads.@spawn for i in eachindex(y, x) #The original Threads.@threads for i in eachindex(y, x) works
@inbounds y[i] += x[i]
end
return nothing
end
fill!(y, 2.0f0)
@btime parallel_add!(y, x)
On my machine (Windows 10, Julia 1.3, official binaries), this quite repeatably hangs Julia on the last line. Well, at least the CPU jumps to ~60% for 30-40sec, then drops to ~30% and stays there for a minute or two until my patience runs out and I kill the process.
Using @time instead of @btime or @benchmark works fine, as does returning the modified line in parallel_add! to its original form of using Threads.@threads instead of Threads.@spawn. It seems to only be the combination of @spawn and @btime or @benchmark that causes the “hang”.
Could anyone please either explain what is happening, or point me to somewhere I can read up on this?
Thanks!
You are spawning one task for every element in the array and the task then does one addition. That will cause a tremendous overhead! Maybe you meant to use the Threads.@threads macro which divides up the loop in equal chunks and runs the chunk in parallel. Otherwise, you need to have more work per task for it to be beneficial.
Well that certainly wasn’t the intention… Thanks. A little reading on my part seems to be required so I don’t tie myself into knots with this again.
Thanks!
Actually, rereading a bit more carefully. I might have been wrong. I wonder if this shouldn’t just have spawned one task that did the whole loop?
Just to clarify though: why does this apparently not affect @time, but only @btime? Is it the multiple executions triggered by @btime?
Ok, giving it another try. I think what happens here is that you have nothing waiting on the task you spawn so the function returns immidiately and BenchmarkTools create a huge number of tasks in its benchmark loop because it cannot know how long a task actually take to run.
Something like
function parallel_add!(y, x)
t = Threads.@spawn for i in eachindex(y, x) #The original Threads.@threads for i in eachindex(y, x) works
@inbounds y[i] += x[i]
end
wait(t)
return nothing
end
seems to work.
Thanks for effort in clarifying this.
Just to close this out, on the vague chance anyone was interested:
using BenchmarkTools
using CUDAdrv, CUDAnative, CuArrays
N = 2^20
x = fill(1.0f0, N) # a vector filled with 1.o (Float32)
y = fill(2.0f0, N) # a vector filled with 2.0
function sequential_add!(y, x)
for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
return nothing
end
@info "Sequential add: "
@btime sequential_add!(y, x)
# 149.800 ÎĽs (0 allocations: 0 bytes)
function parallel_add!(y, x)
Threads.@threads for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
return nothing
end
fill!(y, 2.0f0)
@info "Parallel add 1 (@threads): "
@btime parallel_add!(y, x)
# 51.800 ÎĽs (29 allocations: 3.44 KiB)
function parallel_add2!(y, x)
t = Threads.@spawn for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
wait(t)
return nothing
end
fill!(y, 2.0f0)
@info "Parallel add 2 (@spawn): "
@btime parallel_add2!(y, x)
# 157.400 ÎĽs (9 allocations: 912 bytes)
x_d = CuArrays.fill(1.0f0, N)
y_d = CuArrays.fill(2.0f0, N)
function add_broadcast!(y, x)
CuArrays.@sync y .+= x
return
end
@info "Broadcast GPU array add (@sync): "
@btime add_broadcast!(y_d, x_d)
# 68.001 ÎĽs (61 allocations: 2.34 KiB)
function gpu_add1!(y, x)
for i = 1:length(y)
@inbounds y[i] += x[i]
end
return nothing
end
fill!(y_d, 2.0f0)
@info "GPU kernel add (@cuda): "
@btime @cuda gpu_add1!(y_d, x_d)
# 5.217 ÎĽs (48 allocations: 1.59 KiB)
So, today’s lesson is: whether on the CPU or GPU, there is probably a faster and a slower way of doing something. I expect that which is which in each case depends entirely on the application. Also, I need to put in more effort to understand the subtle differences.
Thanks again to @kristoffer.carlsson for helping explain the puzzle with benchmarking @spawn above. The explanation was quite educational.