In these examples, manually splitting and reducing the sum provides a good scaling, while using the Atomic operations is prohibitive:
using Base.Threads, BenchmarkTools, Test
function f_manual()
total_Sum = zeros(nthreads())
@threads for i in 1:10000000
total_Sum[threadid()] += sin(i)
end
sum(total_Sum)
end
function f_atomic()
total_Sum = Atomic{Float64}(0.)
@threads for i in 1:10000000
atomic_add!(total_Sum,sin(i))
end
total_Sum.value
end
@test f_manual() â f_atomic()
print("manual: "); @btime f_manual();
print("atomic: "); @btime f_atomic();
Result:
leandro@pitico:~/Drive/Work/JuliaPlay% julia -t8 ompt.jl
manual: 99.064 ms (44 allocations: 3.84 KiB)
atomic: 993.152 ms (44 allocations: 3.72 KiB)
It is me that donât know how to use atomic operations, there is an explanation, or what?
for comparison, this is the result with a single thread:
leandro@pitico:~/Drive/Work/JuliaPlay% julia ompt.jl
manual: 295.531 ms (7 allocations: 640 bytes)
atomic: 415.734 ms (7 allocations: 560 bytes)
Think of Atomic as locking without having to manage the lock. Itâs a much faster, hardware-implemented lock. You still canât magically simultaneously update a single memory location in parallel.
Yeah, I think I donât understand what Atomic is for. Because the example given in the manual is:
julia> acc = Atomic{Int64}(0)
Atomic{Int64}(0)
julia> @threads for i in 1:1000
atomic_add!(acc, 1)
end
julia> acc[]
1000
I was imagining that the use of it was not only to avoid a race condition (by blocking the access of the variable), but to automatically do the splitting in memory and reduction.
Clearly that is not the case.
Is there a syntax (available or in development) to implement a fast splitting/reduction, as for example the @reduce macro of FLoops?
There are awesome videos on OpenMP, in particular at this lecture: Introduction to OpenMP: 07 Module 4 - YouTube idea of Barrier/Critical/Atomic is explained. Of course, itâs all in the relation to OpenMP, but I have a feeling, that the same idea holds in case of Julia atomic_add! and other atomic functions.
By the way, I think that vanilla Julia version of OpenMP version of your program would be something like (with round robin balancing)
function f(n = 1000)
res = Threads.Atomic{Float64}(0.0)
@sync for _ in 1:Threads.nthreads()
Threads.@spawn begin
id = Threads.threadid()
nth = Threads.nthreads()
i = id
s = 0.0
while i <= n
s += sin(i)
i += nth
end
Threads.atomic_add!(res, s)
end
end
return res[]
end
julia> @time f(1_000_000_000)
9.510698 seconds (41 allocations: 2.453 KiB)
0.42129448674568837
julia> Threads.nthreads()
4
Just for playing around, this is how your option (f_omp) compares with other alternatives:
leandro@pitico:~/Drive/Work/JuliaPlay% julia -t8 ompt.jl
manual: 73.762 ms (44 allocations: 3.84 KiB)
atomic: 675.404 ms (43 allocations: 3.69 KiB)
floop: 59.971 ms (63 allocations: 3.67 KiB)
f_omp: 54.586 ms (70 allocations: 4.47 KiB)
Code
using Base.Threads, BenchmarkTools, Test, FLoops
function f_manual(N)
total_Sum = zeros(nthreads())
@threads for i in 1:N
total_Sum[threadid()] += sin(i)
end
sum(total_Sum)
end
function f_atomic(N)
total_Sum = Atomic{Float64}(0.)
@threads for i in 1:N
atomic_add!(total_Sum,sin(i))
end
total_Sum.value
end
function f_floop(N)
@floop for i in 1:N
@reduce(total_Sum += sin(i))
end
total_Sum
end
function f_omp(N)
res = Threads.Atomic{Float64}(0.0)
@sync for _ in 1:Threads.nthreads()
Threads.@spawn begin
id = Threads.threadid()
nth = Threads.nthreads()
i = id
s = 0.0
while i <= N
s += sin(i)
i += nth
end
Threads.atomic_add!(res, s)
end
end
return res[]
end
N = 10000000
@test f_manual(N) â f_atomic(N) â f_floop(N) â f_omp(N)
print("manual: "); @btime f_manual($N);
print("atomic: "); @btime f_atomic($N);
print("floop: "); @btime f_floop($N);
print("f_omp: "); @btime f_omp($N);
The fortran version takes: 0m15,044s (measured with systemâs time).
Your version takes: 0m7,201s
also measured from outside with time julia -t8 ...
For the records, this is the âFortranâ version in case:
Code
PROGRAM Parallel_Hello_World
USE OMP_LIB
REAL*8 :: partial_Sum, total_Sum
!$OMP PARALLEL PRIVATE(partial_Sum) SHARED(total_Sum)
partial_Sum = 0.d0
total_Sum = 0.d0
!$OMP DO
DO i=1,1000000000
partial_Sum = partial_Sum + dsin(dble(i))
END DO
!$OMP END DO
!$OMP CRITICAL
total_Sum = total_Sum + partial_Sum
!$OMP END CRITICAL
!$OMP END PARALLEL
PRINT *, "Total Sum: ", total_Sum
END
Reduction using atomics (or locks) is unlikely to perform well unless it is used very infrequently as in Skofferâs code. However, this approach has another downside that the result is not deterministic for non-associative operators like + (atomic_add!) on floats. This does not occur in the approach used in packages like FLoops.jl and Folds.jl where the result is deterministic by default.