@spawn and @btime/@benchmark causes Julia to "hang"...?

I just set up a new machine with a CUDA GPU and started playing with the getting started examples from CuArrays.jl when I ran into a strange problem, not linked to CuArrays, but with BenchmarkTools and Threads.

The following code is a slightly modified version of the example from CuArrays:

using BenchmarkTools

N = 2^20
x = fill(1.0f0, N) # a vector filled with 1.o (Float32)
y = fill(2.0f0, N)  # a vector filled with 2.0

function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

@btime sequential_add!(y, x)

function parallel_add!(y, x)
    Threads.@spawn for i in eachindex(y, x) #The original Threads.@threads for i in eachindex(y, x) works
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 2.0f0)
@btime parallel_add!(y, x)

On my machine (Windows 10, Julia 1.3, official binaries), this quite repeatably hangs Julia on the last line. Well, at least the CPU jumps to ~60% for 30-40sec, then drops to ~30% and stays there for a minute or two until my patience runs out and I kill the process.

Using @time instead of @btime or @benchmark works fine, as does returning the modified line in parallel_add! to its original form of using Threads.@threads instead of Threads.@spawn. It seems to only be the combination of @spawn and @btime or @benchmark that causes the “hang”.

Could anyone please either explain what is happening, or point me to somewhere I can read up on this?

Thanks!

You are spawning one task for every element in the array and the task then does one addition. That will cause a tremendous overhead! Maybe you meant to use the Threads.@threads macro which divides up the loop in equal chunks and runs the chunk in parallel. Otherwise, you need to have more work per task for it to be beneficial.

3 Likes

Well that certainly wasn’t the intention… Thanks. A little reading on my part seems to be required so I don’t tie myself into knots with this again.

Thanks!

Actually, rereading a bit more carefully. I might have been wrong. I wonder if this shouldn’t just have spawned one task that did the whole loop?

Just to clarify though: why does this apparently not affect @time, but only @btime? Is it the multiple executions triggered by @btime?

Ok, giving it another try. I think what happens here is that you have nothing waiting on the task you spawn so the function returns immidiately and BenchmarkTools create a huge number of tasks in its benchmark loop because it cannot know how long a task actually take to run.

Something like

function parallel_add!(y, x)
           t = Threads.@spawn for i in eachindex(y, x) #The original Threads.@threads for i in eachindex(y, x) works
               @inbounds y[i] += x[i]
           end
           wait(t)
           return nothing
       end

seems to work.

4 Likes

Thanks for effort in clarifying this.

Just to close this out, on the vague chance anyone was interested:

using BenchmarkTools
using CUDAdrv, CUDAnative, CuArrays

N = 2^20
x = fill(1.0f0, N) # a vector filled with 1.o (Float32)
y = fill(2.0f0, N)  # a vector filled with 2.0

function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

@info "Sequential add: "
@btime sequential_add!(y, x)
#  149.800 μs (0 allocations: 0 bytes)

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 2.0f0)
@info "Parallel add 1 (@threads): "
@btime parallel_add!(y, x)
#  51.800 μs (29 allocations: 3.44 KiB)

function parallel_add2!(y, x)
    t = Threads.@spawn for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    wait(t)
    return nothing
end

fill!(y, 2.0f0)
@info "Parallel add 2 (@spawn): "
@btime parallel_add2!(y, x)
# 157.400 μs (9 allocations: 912 bytes)

x_d = CuArrays.fill(1.0f0, N)
y_d = CuArrays.fill(2.0f0, N)

function add_broadcast!(y, x)
    CuArrays.@sync y .+= x
    return
end

@info "Broadcast GPU array add (@sync): "
@btime add_broadcast!(y_d, x_d)
# 68.001 μs (61 allocations: 2.34 KiB)

function gpu_add1!(y, x)
    for i = 1:length(y)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y_d, 2.0f0)
@info "GPU kernel add (@cuda): "
@btime @cuda gpu_add1!(y_d, x_d)
# 5.217 μs (48 allocations: 1.59 KiB)

So, today’s lesson is: whether on the CPU or GPU, there is probably a faster and a slower way of doing something. I expect that which is which in each case depends entirely on the application. Also, I need to put in more effort to understand the subtle differences.

Thanks again to @kristoffer.carlsson for helping explain the puzzle with benchmarking @spawn above. The explanation was quite educational.

1 Like