Atomic much slower than @spawn-ed tasks

The Julia manual offers two solutions to avoid a race condition when using @threads, namely @spawn-ed tasks and atomic operations. The manual states that the atomic approach “may be more performant depending on the characteristics of the operations”. However, for this example, the atomic approach is significantly slower. What is the reason for this huge discrepancy? When should the atomic approach be used? The results below are based on using 4 threads:

function sum_single(a)
    s = 0
    for i in a
        s += i
    end
    s
end
function sum_multi_good(a)
    chunks = Iterators.partition(a, length(a) ÷ Threads.nthreads())
    tasks = map(chunks) do chunk
        Threads.@spawn sum_single(chunk)
    end
    chunk_sums = fetch.(tasks)
    return sum_single(chunk_sums)
end
@btime sum_multi_good(1:100_000_000)
2.217 μs (35 allocations: 2.56 KiB)
5000000050000000
function sum_atomic!(a)
    atomic = Threads.Atomic{Int}(0)
    Threads.@threads for id in a
        Threads.atomic_add!(atomic, id)
    end
    return atomic[]
end

@btime sum_atomic!(1:100_000_000)
760.684 ms (26 allocations: 2.33 KiB)
5000000050000000

There isn’t much more to it than:
Atomic operations are slow.
(just like locks are slow – and indeed a lot of locks are built on atomics).

So if you can write your solution to in a way that minimizes needing them you are almost always better to do so.

But sometimes you can’t write your problem in a way that avoids them.

You could imagine a variabt of your code which uses an atomic_add to do the chunk of sums.

Your (manual) scenario here is in the category of non-multithreading is better anyway. Because there is no real computation taking place, you end up paying for the overhead without any real multithreading benefit.

There are scenarios where no flavor of multithreading is going to produce a benefit (or when multithreading is doing more harm than good).

When adding some real work, we can see the two approaches become indistinguishable (at least in this scenario):

using BenchmarkTools

function costly(x)
    z = 0.0
    for _ in 1:2_000_000
        z += rand()
    end
    x
end

function sum_single(a)
    s = 0
    for i in a
        s += costly(i)
    end
    s
end

function sum_multi_good(a)
    chunks = Iterators.partition(a, length(a) ÷ Threads.nthreads())
    tasks = map(chunks) do chunk
        Threads.@spawn sum_single(chunk)
    end
    chunk_sums = fetch.(tasks)
    return sum_single(chunk_sums)
end
@info "sum_multi_good: "
@btime sum_multi_good(1:10000)

function sum_atomic!(a)
    atomic = Threads.Atomic{Int}(0)
    Threads.@threads for id in a
        Threads.atomic_add!(atomic, costly(id))
    end
    return atomic[]
end

@info "sum_atomic!: "
@btime sum_atomic!(1:10000)

@info "sum_single: "
@time "regular: " sum_single(1:10000)

Running with julia -t 4 atomicplay.jl will produce (on my machine):

[ Info: sum_multi_good: 
  4.382 s (37 allocations: 2.61 KiB)
[ Info: sum_atomic!: 
  4.370 s (27 allocations: 2.36 KiB)
[ Info: sum_single: 
regular: : 16.596185 seconds
4 Likes

note that sleeping is a bad example because sleeping doesn’t take compute, so when your program sleeps it will instantly schedule another task to run on the core

2 Likes

I don’t think there is a real difference in this scenario - but I changed the example to avoid sleep altogether.

2 Likes