Manual reduction much faster than atomic_add with @threads

lmiq · May 7, 2021, 2:31pm

In these examples, manually splitting and reducing the sum provides a good scaling, while using the Atomic operations is prohibitive:

using Base.Threads, BenchmarkTools, Test

function f_manual()
    total_Sum = zeros(nthreads())
    @threads for i in 1:10000000
        total_Sum[threadid()] += sin(i)
    end
    sum(total_Sum)
end

function f_atomic()
    total_Sum = Atomic{Float64}(0.)
    @threads for i in 1:10000000
        atomic_add!(total_Sum,sin(i))
    end
    total_Sum.value
end

@test f_manual() ≈ f_atomic()

print("manual: "); @btime f_manual();
print("atomic: "); @btime f_atomic();

Result:

leandro@pitico:~/Drive/Work/JuliaPlay% julia -t8 ompt.jl 
manual:   99.064 ms (44 allocations: 3.84 KiB)
atomic:   993.152 ms (44 allocations: 3.72 KiB)

It is me that don’t know how to use atomic operations, there is an explanation, or what?

for comparison, this is the result with a single thread:

leandro@pitico:~/Drive/Work/JuliaPlay% julia ompt.jl 
manual:   295.531 ms (7 allocations: 640 bytes)
atomic:   415.734 ms (7 allocations: 560 bytes)

Jeff_Emanuel · May 7, 2021, 2:41pm

Think of Atomic as locking without having to manage the lock. It’s a much faster, hardware-implemented lock. You still can’t magically simultaneously update a single memory location in parallel.

lmiq · May 7, 2021, 2:45pm

Yeah, I think I don’t understand what Atomic is for. Because the example given in the manual is:


julia> acc = Atomic{Int64}(0)
Atomic{Int64}(0)

julia> @threads for i in 1:1000
          atomic_add!(acc, 1)
       end

julia> acc[]
1000

I was imagining that the use of it was not only to avoid a race condition (by blocking the access of the variable), but to automatically do the splitting in memory and reduction.

Clearly that is not the case.

Is there a syntax (available or in development) to implement a fast splitting/reduction, as for example the @reduce macro of FLoops?

Skoffer · May 7, 2021, 3:09pm

There are awesome videos on OpenMP, in particular at this lecture: Introduction to OpenMP: 07 Module 4 - YouTube idea of Barrier/Critical/Atomic is explained. Of course, it’s all in the relation to OpenMP, but I have a feeling, that the same idea holds in case of Julia atomic_add! and other atomic functions.

By the way, I think that vanilla Julia version of OpenMP version of your program would be something like (with round robin balancing)

function f(n = 1000)
    res = Threads.Atomic{Float64}(0.0)
    @sync for _ in 1:Threads.nthreads()
        Threads.@spawn begin
            id = Threads.threadid()
            nth = Threads.nthreads()
            i = id
            s = 0.0
            while i <= n
                s += sin(i)
                i += nth
            end
            Threads.atomic_add!(res, s)
        end
    end
    return res[]
end

julia> @time f(1_000_000_000)
  9.510698 seconds (41 allocations: 2.453 KiB)
0.42129448674568837

julia> Threads.nthreads()
4

I wonder, how Julia avoids false sharing?

lmiq · May 7, 2021, 3:18pm

Just for playing around, this is how your option (f_omp) compares with other alternatives:

leandro@pitico:~/Drive/Work/JuliaPlay% julia -t8 ompt.jl 
manual:   73.762 ms (44 allocations: 3.84 KiB)
atomic:   675.404 ms (43 allocations: 3.69 KiB)
floop:   59.971 ms (63 allocations: 3.67 KiB)
f_omp:   54.586 ms (70 allocations: 4.47 KiB)

Code

using Base.Threads, BenchmarkTools, Test, FLoops

function f_manual(N)
    total_Sum = zeros(nthreads())
    @threads for i in 1:N
        total_Sum[threadid()] += sin(i)
    end
    sum(total_Sum)
end

function f_atomic(N)
    total_Sum = Atomic{Float64}(0.)
    @threads for i in 1:N
        atomic_add!(total_Sum,sin(i))
    end
    total_Sum.value
end

function f_floop(N)
    @floop for i in 1:N
        @reduce(total_Sum += sin(i))
    end
    total_Sum
end

function f_omp(N)
    res = Threads.Atomic{Float64}(0.0)
    @sync for _ in 1:Threads.nthreads()
        Threads.@spawn begin
            id = Threads.threadid()
            nth = Threads.nthreads()
            i = id
            s = 0.0
            while i <= N
                s += sin(i)
                i += nth
            end
            Threads.atomic_add!(res, s)
        end
    end
    return res[]
end

N = 10000000  

@test f_manual(N) ≈ f_atomic(N) ≈ f_floop(N) ≈ f_omp(N)

print("manual: "); @btime f_manual($N);
print("atomic: "); @btime f_atomic($N);
print("floop: "); @btime f_floop($N);
print("f_omp: "); @btime f_omp($N);

Skoffer · May 7, 2021, 3:23pm

Sorry for asking, but what is the result of the OpenMP fortran version? For large enough n.

lmiq · May 7, 2021, 3:34pm

For N=1000000000

The fortran version takes: 0m15,044s (measured with system’s time).

Your version takes: 0m7,201s

also measured from outside with time julia -t8 ...

For the records, this is the “Fortran” version in case:

Code

PROGRAM Parallel_Hello_World
USE OMP_LIB

REAL*8 :: partial_Sum, total_Sum

!$OMP PARALLEL PRIVATE(partial_Sum) SHARED(total_Sum)
    partial_Sum = 0.d0
    total_Sum = 0.d0

    !$OMP DO
    DO i=1,1000000000
        partial_Sum = partial_Sum + dsin(dble(i))
    END DO
    !$OMP END DO

    !$OMP CRITICAL
        total_Sum = total_Sum + partial_Sum
    !$OMP END CRITICAL

!$OMP END PARALLEL
PRINT *, "Total Sum: ", total_Sum

END

tkf · May 7, 2021, 11:30pm

Reduction using atomics (or locks) is unlikely to perform well unless it is used very infrequently as in Skoffer’s code. However, this approach has another downside that the result is not deterministic for non-associative operators like + (atomic_add!) on floats. This does not occur in the approach used in packages like FLoops.jl and Folds.jl where the result is deterministic by default.

Topic		Replies	Views
Atomic much slower than @spawn-ed tasks Performance parallel	4	485	August 22, 2023
Looking for advice on threading General Usage	6	1607	January 15, 2020
Atomic Operations are too expensive. Julia at Scale performance , multithreading	3	1016	June 20, 2018
Multithreading in Julia General Usage multithreading	3	1762	May 8, 2017
Use of Atomic variables in julia New to Julia question	11	4073	October 28, 2017

Manual reduction much faster than atomic_add with @threads

Related topics