@threads for loop to find maximum of function

The following code causes a type instability. Is there a better way to write this code? Perhaps using Threads.Atomic?

max_obj = -Inf64
xs = 0.0
Threads.@threads for i = 1:100
    obj, x = f(i)
    if obj >= max_obj
        max_obj = obj
        xs = x
    end
end

How can this work with multi-threading? Also, why type-instability?

The goal is to find the value of i that maximizes f(i) in parallel with multi-threading without explicitly allocating an array to store all the values of f(i).

I understand the goal, but max_obj and xs are scalar values which are accessed and modified by multiple threads at the same time

Yes, I understand that a race condition could occur, so I’m wondering how to do this correctly without explicitly allocating any memory?

For example, I could do

fv = zeros(100)
x = zeros(100)
Threads.@threads for i = 1:100
    obj[i], x[i] = f(i)
end
maximum(fv)

but I want to avoid the overhead of allocating the arrays fv and x.

Ok, so you want to parallelise only the evaluation of the function. Note that you don’t need to allocate an accumulator of 100 elements, you only need one of nthreads():

julia> using Base.Threads, BenchmarkTools

julia> f(i) = sin(i) * log(i)
f (generic function with 1 method)

julia> function find_max_threads_acc()
           max_xs = fill(-Inf, nthreads())
           Threads.@threads for i in 1:1000
               x = f(i)
               if x > max_xs[threadid()]
                   max_xs[threadid()] = x
               end
           end
           return maximum(max_xs)
       end
find_max_threads_acc (generic function with 1 method)

julia> @btime find_max_threads_acc()
  9.303 μs (22 allocations: 1.97 KiB)
6.892393151350101

If you want to use atomics, you can do for example

julia> function find_max_threads_atomic()
           max_x = Threads.Atomic{Float64}(-Inf)
           Threads.@threads for i in 1:1000
               x = f(i)
               if x > max_x[]
                   max_x[] = x
               end
           end
           return max_x[]
       end
find_max_threads_atomic (generic function with 1 method)

julia> @btime find_max_threads_atomic()
  8.513 μs (22 allocations: 1.88 KiB)
6.892393151350101

For comparison, the non-threaded search:

julia> function find_max_no_threads()
           max_x = -Inf
           for i in 1:1000
               x = f(i)
               if x > max_x
                   max_x = x
               end
           end
           return max_x
       end
find_max_no_threads (generic function with 1 method)

julia> @btime find_max_no_thread()
  22.896 μs (0 allocations: 0 bytes)
6.892393151350101

I’m sure someone will come up with other nicer solutions using higher-level packages

2 Likes

Thank you! I’m wondering if the following can cause a race condition?

if x > max_x[]
    max_x[] = x
end

Let’s say we have max_x < x1 < x2 and we calculate x1 and x2 on different threads, then x2 and x1 both pass the if statement at the same time before max_x is updated but max_x := x2 first and then max_x := x1 happens, resulting in the wrong answer.

I know we could use Threads.atomic_max! instead but I also want to find the argmax that maximizes the function so that doesn’t solve the issue.

My solution is to use Tullio.jl

julia> using Tullio

julia> f(i) = cos(i) * log(i);

julia> threadmax(f, v) = @tullio (max) out := f(v[i]) threads=length(v)÷Threads.nthreads()
threadmax (generic function with 1 method)

julia> @btime threadmax(f, 1:1000)
  10.030 μs (91 allocations: 4.25 KiB)
6.904336399050647

For comparison, here are Mose’s results on my computer:

julia> @btime find_max_threads_acc()
  12.080 μs (31 allocations: 2.84 KiB)
6.892393151350101

julia> @btime find_max_no_threads()
  14.920 μs (0 allocations: 0 bytes)
6.892393151350101
2 Likes

On my system, timings with 4 threads give the opposite result

julia> @btime threadmax(f, 1:1000)
  9.807 μs (92 allocations: 4.28 KiB)
6.904336399050647

julia> @btime find_max_threads_acc()
  9.408 μs (23 allocations: 1.98 KiB)
6.904336399050647

julia> @btime find_max_threads_atomic()
  8.599 μs (21 allocations: 1.84 KiB)
6.904336399050647

but the Tullio-based solution is definitely nicer to write

That example is a little small for threading.

julia> Threads.nthreads(), Sys.CPU_THREADS
(36, 36)

julia> @btime threadmax(f, 1:1000)
  54.724 μs (820 allocations: 38.41 KiB)
6.892393151350101

julia> @btime find_max_threads_acc()
  75.373 μs (182 allocations: 16.47 KiB)
6.892393151350101

julia> @btime find_max_no_threads()
  18.232 μs (0 allocations: 0 bytes)
6.892393151350101

julia> function avxmax(f, x)
           m = -Inf
           @avx for i ∈ eachindex(x)
               m = max(m, f(x[i]))
           end
           m
       end
avxmax (generic function with 1 method)

julia> @btime avxmax(f, 1:1000)
  2.324 μs (0 allocations: 0 bytes)
6.8923931513501

SIMD is much faster here.

3 Likes

Aha, I had tried to get LoopVectorization working for this, but it was unhappy with how I wrote the loop. Meant to open an issue but didn’t get around to it

An issue would be welcome. I’m curious what went wrong.

Opened and issue and then closed it because I realized what I did wrong.

For what it’s worth, I don’t see much performance advantage over the naive single threaded implementation on my computer (no avx 512 :frowning_face: )

julia> function avxmax(f, x)
           m = -Inf
           @avx for i ∈ eachindex(x)
               m = max(m, f(x[i]))
           end
           m
       end
avxmax (generic function with 2 methods)

julia> @btime avxmax(f, 1:1000)
  11.770 μs (0 allocations: 0 bytes)
6.9043363990506466

I do see a benefit from manually inlining the function though (presumalby so that LV can replace cos and log with their fast versions):

julia> function avxmax_f(x)
           m = -Inf
           @avx for i ∈ eachindex(x)
               m = max(m, cos(x[i]) * log(x[i]))
           end
           m
       end;

julia> @btime avxmax_f(1:1000)
  8.036 μs (1 allocation: 16 bytes)
6.9043363990506466

=(

Mind running this:

julia> using VectorizationBase, SLEEFPirates

julia> vx = Vec(ntuple(_ -> 2rand(), VectorizationBase.pick_vector_width_val(Float64))...)
Vec{8,Float64}<1.1108350688792568, 1.340642196880145, 0.8214256639761288, 0.5208345017905756, 0.0878112025990947, 1.151752997769079, 1.1212221754046716, 0.9869372570799633>

julia> @benchmark sin($(Ref(vx))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.295 ns (0.00% GC)
  median time:      5.495 ns (0.00% GC)
  mean time:        5.506 ns (0.00% GC)
  maximum time:     15.802 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark log($(Ref(vx))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.866 ns (0.00% GC)
  median time:      7.895 ns (0.00% GC)
  mean time:        7.905 ns (0.00% GC)
  maximum time:     10.776 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> vxt = ntuple(vx, length(vx));

julia> @benchmark sin.($(Ref(vxt))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     45.142 ns (0.00% GC)
  median time:      48.400 ns (0.00% GC)
  mean time:        48.396 ns (0.00% GC)
  maximum time:     51.450 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

julia> @benchmark log.($(Ref(vxt))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     41.010 ns (0.00% GC)
  median time:      42.421 ns (0.00% GC)
  mean time:        42.233 ns (0.00% GC)
  maximum time:     52.753 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

I’m curious if one or both happen to be much slower on your computer. Maybe there’s something obvious to fix in SLEEFPirates.jl.

1 Like

It’s not actually replacing any special functions syntactically at the moment (i.e., it’s not cos -> cos_fast), just via dispatch.
But inlining still helps special functions in a loop, because there are a lot of constants used for evaluating them (polynomial coefficients in particular). If they get inlined, you get to hoist loading all of them out of the loop.

3 Likes

Here’s what I get

julia> using VectorizationBase, SLEEFPirates

julia> vx = Vec(ntuple(_ -> 2rand(), VectorizationBase.pick_vector_width_val(Float64))...)
Vec{4,Float64}<0.9671328433704423, 0.3756828607095555, 0.5049952428293949, 1.6386679997110134>

julia> @benchmark sin($(Ref(vx))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.147 ns (0.00% GC)
  median time:      9.208 ns (0.00% GC)
  mean time:        9.606 ns (0.00% GC)
  maximum time:     19.309 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark log($(Ref(vx))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.109 ns (0.00% GC)
  median time:      10.350 ns (0.00% GC)
  mean time:        10.423 ns (0.00% GC)
  maximum time:     18.939 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> vxt = ntuple(vx, length(vx));

julia> @benchmark sin.($(Ref(vxt))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.345 ns (0.00% GC)
  median time:      17.735 ns (0.00% GC)
  mean time:        20.624 ns (0.00% GC)
  maximum time:     53.176 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark log.($(Ref(vxt))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.681 ns (0.00% GC)
  median time:      22.820 ns (0.00% GC)
  mean time:        25.763 ns (0.00% GC)
  maximum time:     56.024 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

1 Like

Interesting. The scalar versions are about the same fast per element at 4 / (17-22ns) vs (8 / (41-45ns)), but the SIMD version is > 2x slower at 4 per 9-10ns vs 8 per 5-7ns.

I guess this is likely mostly a Zen1 problem, like when we had to special-case reduction handling for Zen1 when doing the sum benchmarks.
For anyone reading unfamiliar with the problem, Zen1 has halfrate AVX2 (they didn’t actually have 256 bit units, but 128 bit units working together to emulate 256 bits).
Hence while scalar performance is similar, it takes almost twice as long to evaluate SIMD code (9-10 vs 5-7 ns). And compared to a computer with AVX512, evaluating that SIMD code is also getting just half as much work done (4 instead of 8 elements).

I guess the inlined version is at least approaching 2x faster.

Out of curiosity, does avxmax(f, x) match avxmax_f when you define:

@inline f(i) = cos(i) * log(i);

?

EDIT:
I was benchmarking Mose’s f that used sin earlier. Seems cos is faster:

julia> @inline f_inline(i) = cos(i) * log(i);

julia> @inline f(i) = cos(i) * log(i);

julia> @btime avxmax(f, 1:1000)
  1.940 μs (0 allocations: 0 bytes)
6.9043363990506466

julia> @btime avxmax(f_inline, 1:1000)
  1.939 μs (0 allocations: 0 bytes)
6.9043363990506466

julia> function avxmax_f(x)
           m = -Inf
           @avx for i ∈ eachindex(x)
               m = max(m, cos(x[i]) * log(x[i]))
           end
           m
       end
avxmax_f (generic function with 1 method)

julia> @btime avxmax_f(1:1000)
  1.894 μs (0 allocations: 0 bytes)
6.9043363990506466
2 Likes

Yes

julia> @inline f(i) = cos(i) * log(i);

julia> @btime avxmax(f, 1:1000)
  8.037 μs (0 allocations: 0 bytes)
6.9043363990506466
1 Like

FYI there’s ThreadsX.maximum: GitHub - tkf/ThreadsX.jl: Parallelized Base functions

I’d also recommend using data-parallel frameworks (e.g., Tullio.jl, JuliaFolds/*.jl, etc.) instead of using @threads for that tightly couples the execution mechanism and the algorithm. For example, it makes switching to GPU and/or distributed computing hard and very easy to introduce concurrency bugs. In the case of the example in the OP, it’d mean to use mapreduce(f, max, 1:100) (if you don’t already have parallelized version maximum).

5 Likes