The issue is that atomic_min!
is a write operation that needs to take ownership of the cache-line. Many cores hitting the same cache-line with atomic_min!
will absolutely hammer it.
The easiest way would be to have stopCondition = Threads.Atomic{Bool}(false)
and then check it in each iteration.
If you need the current known global minimum, in order to speed up e.g. branch-and-bound algorithms, then you can do
globalMin::Threads.Atomic{UInt64}
for candidate in candidates_in_this_task
globalMin[] == 0 && break
result = compute_stuff(candidate, globalMin[])
if result < globalMin[]
atomic_min!(globalMin, result)
end
end
It is illustrative to compare
julia> atomic_variant(globMin, m) = begin Threads.atomic_min!(globMin, m); 0 end
atomic_variant (generic function with 1 method)
julia> @code_native atomic_variant(Threads.Atomic{Int}(0), 0)
.text
.file "atomic_variant"
.globl julia_atomic_variant_258 # -- Begin function julia_atomic_variant_258
.p2align 4, 0x90
.type julia_atomic_variant_258,@function
julia_atomic_variant_258: # @julia_atomic_variant_258
; ┌ @ REPL[29]:1 within `atomic_variant`
.cfi_startproc
# %bb.0: # %top
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
; │┌ @ atomics.jl:405 within `atomic_min!`
movq (%rdi), %rax
.p2align 4, 0x90
.LBB0_1: # %atomicrmw.start
# =>This Inner Loop Header: Depth=1
cmpq %rsi, %rax
movq %rsi, %rcx
cmovleq %rax, %rcx
lock cmpxchgq %rcx, (%rdi)
jne .LBB0_1
# %bb.2: # %atomicrmw.end
; │└
xorl %eax, %eax
popq %rbp
.cfi_def_cfa %rsp, 8
retq
.Lfunc_end0:
.size julia_atomic_variant_258, .Lfunc_end0-julia_atomic_variant_258
.cfi_endproc
; └
# -- End function
.section ".note.GNU-stack","",@progbits
julia> speculative_variant(globMin, m) = begin m < globMin[] && Threads.atomic_min!(globMin, m); 0 end
speculative_variant (generic function with 1 method)
julia> @code_native speculative_variant(Threads.Atomic{Int}(0), 0)
.text
.file "speculative_variant"
.globl julia_speculative_variant_261 # -- Begin function julia_speculative_variant_261
.p2align 4, 0x90
.type julia_speculative_variant_261,@function
julia_speculative_variant_261: # @julia_speculative_variant_261
; ┌ @ REPL[30]:1 within `speculative_variant`
.cfi_startproc
# %bb.0: # %top
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
; │┌ @ atomics.jl:358 within `getindex`
movq (%rdi), %rax
; │└
; │┌ @ int.jl:83 within `<`
cmpq %rsi, %rax
; │└
jle .LBB0_3
# %bb.1: # %L8
; │┌ @ atomics.jl:405 within `atomic_min!`
movq (%rdi), %rax
.p2align 4, 0x90
.LBB0_2: # %atomicrmw.start
# =>This Inner Loop Header: Depth=1
cmpq %rsi, %rax
movq %rsi, %rcx
cmovleq %rax, %rcx
lock cmpxchgq %rcx, (%rdi)
jne .LBB0_2
.LBB0_3: # %L13
; │└
xorl %eax, %eax
popq %rbp
.cfi_def_cfa %rsp, 8
retq
.Lfunc_end0:
.size julia_speculative_variant_261, .Lfunc_end0-julia_speculative_variant_261
.cfi_endproc
; └
# -- End function
.section ".note.GNU-stack","",@progbits
The speculative variant is only correct because your global_min is non-increasing. But it is much faster because it doesn’t cause contention in the typical case where the CPU correctly speculates that you don’t have to change it.
(even if you don’t have to decrease globMin
, you still cause cache contention if your CPU branch-predicts a decrease. Spectre FTW!)
PS. The above is for x86. I am not sure about performance implications of the atomic load on arm / powerpc. You might need more tricks there. Sorry, I’m really not up-to-date with atomics perf on arm.