Manual reduction much faster than atomic_add with @threads

In these examples, manually splitting and reducing the sum provides a good scaling, while using the Atomic operations is prohibitive:

using Base.Threads, BenchmarkTools, Test

function f_manual()
    total_Sum = zeros(nthreads())
    @threads for i in 1:10000000
        total_Sum[threadid()] += sin(i)
    end
    sum(total_Sum)
end

function f_atomic()
    total_Sum = Atomic{Float64}(0.)
    @threads for i in 1:10000000
        atomic_add!(total_Sum,sin(i))
    end
    total_Sum.value
end

@test f_manual() ≈ f_atomic()

print("manual: "); @btime f_manual();
print("atomic: "); @btime f_atomic();

Result:

leandro@pitico:~/Drive/Work/JuliaPlay% julia -t8 ompt.jl 
manual:   99.064 ms (44 allocations: 3.84 KiB)
atomic:   993.152 ms (44 allocations: 3.72 KiB)

It is me that don’t know how to use atomic operations, there is an explanation, or what?

for comparison, this is the result with a single thread:

leandro@pitico:~/Drive/Work/JuliaPlay% julia ompt.jl 
manual:   295.531 ms (7 allocations: 640 bytes)
atomic:   415.734 ms (7 allocations: 560 bytes)
1 Like

Think of Atomic as locking without having to manage the lock. It’s a much faster, hardware-implemented lock. You still can’t magically simultaneously update a single memory location in parallel.

2 Likes

Yeah, I think I don’t understand what Atomic is for. Because the example given in the manual is:


julia> acc = Atomic{Int64}(0)
Atomic{Int64}(0)

julia> @threads for i in 1:1000
          atomic_add!(acc, 1)
       end

julia> acc[]
1000

I was imagining that the use of it was not only to avoid a race condition (by blocking the access of the variable), but to automatically do the splitting in memory and reduction.

Clearly that is not the case.

Is there a syntax (available or in development) to implement a fast splitting/reduction, as for example the @reduce macro of FLoops?

There are awesome videos on OpenMP, in particular at this lecture: Introduction to OpenMP: 07 Module 4 - YouTube idea of Barrier/Critical/Atomic is explained. Of course, it’s all in the relation to OpenMP, but I have a feeling, that the same idea holds in case of Julia atomic_add! and other atomic functions.

By the way, I think that vanilla Julia version of OpenMP version of your program would be something like (with round robin balancing)

function f(n = 1000)
    res = Threads.Atomic{Float64}(0.0)
    @sync for _ in 1:Threads.nthreads()
        Threads.@spawn begin
            id = Threads.threadid()
            nth = Threads.nthreads()
            i = id
            s = 0.0
            while i <= n
                s += sin(i)
                i += nth
            end
            Threads.atomic_add!(res, s)
        end
    end
    return res[]
end

julia> @time f(1_000_000_000)
  9.510698 seconds (41 allocations: 2.453 KiB)
0.42129448674568837

julia> Threads.nthreads()
4

I wonder, how Julia avoids false sharing?

2 Likes

Just for playing around, this is how your option (f_omp) compares with other alternatives:

leandro@pitico:~/Drive/Work/JuliaPlay% julia -t8 ompt.jl 
manual:   73.762 ms (44 allocations: 3.84 KiB)
atomic:   675.404 ms (43 allocations: 3.69 KiB)
floop:   59.971 ms (63 allocations: 3.67 KiB)
f_omp:   54.586 ms (70 allocations: 4.47 KiB)

Code
using Base.Threads, BenchmarkTools, Test, FLoops

function f_manual(N)
    total_Sum = zeros(nthreads())
    @threads for i in 1:N
        total_Sum[threadid()] += sin(i)
    end
    sum(total_Sum)
end

function f_atomic(N)
    total_Sum = Atomic{Float64}(0.)
    @threads for i in 1:N
        atomic_add!(total_Sum,sin(i))
    end
    total_Sum.value
end

function f_floop(N)
    @floop for i in 1:N
        @reduce(total_Sum += sin(i))
    end
    total_Sum
end

function f_omp(N)
    res = Threads.Atomic{Float64}(0.0)
    @sync for _ in 1:Threads.nthreads()
        Threads.@spawn begin
            id = Threads.threadid()
            nth = Threads.nthreads()
            i = id
            s = 0.0
            while i <= N
                s += sin(i)
                i += nth
            end
            Threads.atomic_add!(res, s)
        end
    end
    return res[]
end

N = 10000000  

@test f_manual(N) ≈ f_atomic(N) ≈ f_floop(N) ≈ f_omp(N)

print("manual: "); @btime f_manual($N);
print("atomic: "); @btime f_atomic($N);
print("floop: "); @btime f_floop($N);
print("f_omp: "); @btime f_omp($N);

2 Likes

Sorry for asking, but what is the result of the OpenMP fortran version? For large enough n.

For N=1000000000

The fortran version takes: 0m15,044s (measured with system’s time).

Your version takes: 0m7,201s :open_mouth:

also measured from outside with time julia -t8 ...

For the records, this is the “Fortran” version in case:

Code
PROGRAM Parallel_Hello_World
USE OMP_LIB

REAL*8 :: partial_Sum, total_Sum

!$OMP PARALLEL PRIVATE(partial_Sum) SHARED(total_Sum)
    partial_Sum = 0.d0
    total_Sum = 0.d0

    !$OMP DO
    DO i=1,1000000000
        partial_Sum = partial_Sum + dsin(dble(i))
    END DO
    !$OMP END DO

    !$OMP CRITICAL
        total_Sum = total_Sum + partial_Sum
    !$OMP END CRITICAL

!$OMP END PARALLEL
PRINT *, "Total Sum: ", total_Sum

END

2 Likes

Reduction using atomics (or locks) is unlikely to perform well unless it is used very infrequently as in Skoffer’s code. However, this approach has another downside that the result is not deterministic for non-associative operators like + (atomic_add!) on floats. This does not occur in the approach used in packages like FLoops.jl and Folds.jl where the result is deterministic by default.

4 Likes