@spawn and @btime/@benchmark causes Julia to "hang"...?

braamvandyk · December 1, 2019, 1:33pm

I just set up a new machine with a CUDA GPU and started playing with the getting started examples from CuArrays.jl when I ran into a strange problem, not linked to CuArrays, but with BenchmarkTools and Threads.

The following code is a slightly modified version of the example from CuArrays:

using BenchmarkTools

N = 2^20
x = fill(1.0f0, N) # a vector filled with 1.o (Float32)
y = fill(2.0f0, N)  # a vector filled with 2.0

function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

@btime sequential_add!(y, x)

function parallel_add!(y, x)
    Threads.@spawn for i in eachindex(y, x) #The original Threads.@threads for i in eachindex(y, x) works
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 2.0f0)
@btime parallel_add!(y, x)

On my machine (Windows 10, Julia 1.3, official binaries), this quite repeatably hangs Julia on the last line. Well, at least the CPU jumps to ~60% for 30-40sec, then drops to ~30% and stays there for a minute or two until my patience runs out and I kill the process.

Using @time instead of @btime or @benchmark works fine, as does returning the modified line in parallel_add! to its original form of using Threads.@threads instead of Threads.@spawn. It seems to only be the combination of @spawn and @btime or @benchmark that causes the “hang”.

Could anyone please either explain what is happening, or point me to somewhere I can read up on this?

Thanks!

kristoffer.carlsson · December 1, 2019, 1:38pm

You are spawning one task for every element in the array and the task then does one addition. That will cause a tremendous overhead! Maybe you meant to use the Threads.@threads macro which divides up the loop in equal chunks and runs the chunk in parallel. Otherwise, you need to have more work per task for it to be beneficial.

braamvandyk · December 1, 2019, 1:45pm

Well that certainly wasn’t the intention… Thanks. A little reading on my part seems to be required so I don’t tie myself into knots with this again.

Thanks!

kristoffer.carlsson · December 1, 2019, 1:48pm

Actually, rereading a bit more carefully. I might have been wrong. I wonder if this shouldn’t just have spawned one task that did the whole loop?

braamvandyk · December 1, 2019, 1:48pm

Just to clarify though: why does this apparently not affect @time, but only @btime? Is it the multiple executions triggered by @btime?

kristoffer.carlsson · December 1, 2019, 1:54pm

Ok, giving it another try. I think what happens here is that you have nothing waiting on the task you spawn so the function returns immidiately and BenchmarkTools create a huge number of tasks in its benchmark loop because it cannot know how long a task actually take to run.

Something like

function parallel_add!(y, x)
           t = Threads.@spawn for i in eachindex(y, x) #The original Threads.@threads for i in eachindex(y, x) works
               @inbounds y[i] += x[i]
           end
           wait(t)
           return nothing
       end

seems to work.

braamvandyk · December 1, 2019, 2:11pm

Thanks for effort in clarifying this.

braamvandyk · December 1, 2019, 4:28pm

Just to close this out, on the vague chance anyone was interested:

using BenchmarkTools
using CUDAdrv, CUDAnative, CuArrays

N = 2^20
x = fill(1.0f0, N) # a vector filled with 1.o (Float32)
y = fill(2.0f0, N)  # a vector filled with 2.0

function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

@info "Sequential add: "
@btime sequential_add!(y, x)
#  149.800 μs (0 allocations: 0 bytes)

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 2.0f0)
@info "Parallel add 1 (@threads): "
@btime parallel_add!(y, x)
#  51.800 μs (29 allocations: 3.44 KiB)

function parallel_add2!(y, x)
    t = Threads.@spawn for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    wait(t)
    return nothing
end

fill!(y, 2.0f0)
@info "Parallel add 2 (@spawn): "
@btime parallel_add2!(y, x)
# 157.400 μs (9 allocations: 912 bytes)

x_d = CuArrays.fill(1.0f0, N)
y_d = CuArrays.fill(2.0f0, N)

function add_broadcast!(y, x)
    CuArrays.@sync y .+= x
    return
end

@info "Broadcast GPU array add (@sync): "
@btime add_broadcast!(y_d, x_d)
# 68.001 μs (61 allocations: 2.34 KiB)

function gpu_add1!(y, x)
    for i = 1:length(y)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y_d, 2.0f0)
@info "GPU kernel add (@cuda): "
@btime @cuda gpu_add1!(y_d, x_d)
# 5.217 μs (48 allocations: 1.59 KiB)

So, today’s lesson is: whether on the CPU or GPU, there is probably a faster and a slower way of doing something. I expect that which is which in each case depends entirely on the application. Also, I need to put in more effort to understand the subtle differences.

Thanks again to @kristoffer.carlsson for helping explain the puzzle with benchmarking @spawn above. The explanation was quite educational.

Topic		Replies	Views
Confusion regarding Threads.@spawn performance General Usage	4	1084	July 19, 2020
Behavior of `@time` when using `@spawn` (in Julia 1.8 highlights blog post) New to Julia multithreading	2	421	August 22, 2022
Huge performance drop using @spawn General Usage performance	3	847	January 11, 2020
Running task on background thread slows performance and increases jitter Performance multithreading	2	779	April 11, 2020
How to measure time in spawned process? New to Julia parallel	3	1245	June 18, 2019

@spawn and @btime/@benchmark causes Julia to "hang"...?

Related topics