@threads for loop to find maximum of function

MFairley · January 13, 2021, 12:25am

The following code causes a type instability. Is there a better way to write this code? Perhaps using Threads.Atomic?

max_obj = -Inf64
xs = 0.0
Threads.@threads for i = 1:100
    obj, x = f(i)
    if obj >= max_obj
        max_obj = obj
        xs = x
    end
end

giordano · January 13, 2021, 12:45am

How can this work with multi-threading? Also, why type-instability?

MFairley · January 13, 2021, 12:55am

The goal is to find the value of i that maximizes f(i) in parallel with multi-threading without explicitly allocating an array to store all the values of f(i).

giordano · January 13, 2021, 1:03am

I understand the goal, but max_obj and xs are scalar values which are accessed and modified by multiple threads at the same time

MFairley · January 13, 2021, 1:11am

Yes, I understand that a race condition could occur, so I’m wondering how to do this correctly without explicitly allocating any memory?

For example, I could do

fv = zeros(100)
x = zeros(100)
Threads.@threads for i = 1:100
    obj[i], x[i] = f(i)
end
maximum(fv)

but I want to avoid the overhead of allocating the arrays fv and x.

giordano · January 13, 2021, 1:40am

Ok, so you want to parallelise only the evaluation of the function. Note that you don’t need to allocate an accumulator of 100 elements, you only need one of nthreads():

julia> using Base.Threads, BenchmarkTools

julia> f(i) = sin(i) * log(i)
f (generic function with 1 method)

julia> function find_max_threads_acc()
           max_xs = fill(-Inf, nthreads())
           Threads.@threads for i in 1:1000
               x = f(i)
               if x > max_xs[threadid()]
                   max_xs[threadid()] = x
               end
           end
           return maximum(max_xs)
       end
find_max_threads_acc (generic function with 1 method)

julia> @btime find_max_threads_acc()
  9.303 μs (22 allocations: 1.97 KiB)
6.892393151350101

If you want to use atomics, you can do for example

julia> function find_max_threads_atomic()
           max_x = Threads.Atomic{Float64}(-Inf)
           Threads.@threads for i in 1:1000
               x = f(i)
               if x > max_x[]
                   max_x[] = x
               end
           end
           return max_x[]
       end
find_max_threads_atomic (generic function with 1 method)

julia> @btime find_max_threads_atomic()
  8.513 μs (22 allocations: 1.88 KiB)
6.892393151350101

For comparison, the non-threaded search:

julia> function find_max_no_threads()
           max_x = -Inf
           for i in 1:1000
               x = f(i)
               if x > max_x
                   max_x = x
               end
           end
           return max_x
       end
find_max_no_threads (generic function with 1 method)

julia> @btime find_max_no_thread()
  22.896 μs (0 allocations: 0 bytes)
6.892393151350101

I’m sure someone will come up with other nicer solutions using higher-level packages

MFairley · January 13, 2021, 1:55am

Thank you! I’m wondering if the following can cause a race condition?

if x > max_x[]
    max_x[] = x
end

Let’s say we have max_x < x1 < x2 and we calculate x1 and x2 on different threads, then x2 and x1 both pass the if statement at the same time before max_x is updated but max_x := x2 first and then max_x := x1 happens, resulting in the wrong answer.

I know we could use Threads.atomic_max! instead but I also want to find the argmax that maximizes the function so that doesn’t solve the issue.

Mason · January 13, 2021, 2:18am

My solution is to use Tullio.jl

julia> using Tullio

julia> f(i) = cos(i) * log(i);

julia> threadmax(f, v) = @tullio (max) out := f(v[i]) threads=length(v)÷Threads.nthreads()
threadmax (generic function with 1 method)

julia> @btime threadmax(f, 1:1000)
  10.030 μs (91 allocations: 4.25 KiB)
6.904336399050647

For comparison, here are Mose’s results on my computer:

julia> @btime find_max_threads_acc()
  12.080 μs (31 allocations: 2.84 KiB)
6.892393151350101

julia> @btime find_max_no_threads()
  14.920 μs (0 allocations: 0 bytes)
6.892393151350101

giordano · January 13, 2021, 7:19pm

On my system, timings with 4 threads give the opposite result

julia> @btime threadmax(f, 1:1000)
  9.807 μs (92 allocations: 4.28 KiB)
6.904336399050647

julia> @btime find_max_threads_acc()
  9.408 μs (23 allocations: 1.98 KiB)
6.904336399050647

julia> @btime find_max_threads_atomic()
  8.599 μs (21 allocations: 1.84 KiB)
6.904336399050647

but the Tullio-based solution is definitely nicer to write

Elrod · January 13, 2021, 7:58pm

That example is a little small for threading.

julia> Threads.nthreads(), Sys.CPU_THREADS
(36, 36)

julia> @btime threadmax(f, 1:1000)
  54.724 μs (820 allocations: 38.41 KiB)
6.892393151350101

julia> @btime find_max_threads_acc()
  75.373 μs (182 allocations: 16.47 KiB)
6.892393151350101

julia> @btime find_max_no_threads()
  18.232 μs (0 allocations: 0 bytes)
6.892393151350101

julia> function avxmax(f, x)
           m = -Inf
           @avx for i ∈ eachindex(x)
               m = max(m, f(x[i]))
           end
           m
       end
avxmax (generic function with 1 method)

julia> @btime avxmax(f, 1:1000)
  2.324 μs (0 allocations: 0 bytes)
6.8923931513501

SIMD is much faster here.

Mason · January 13, 2021, 8:08pm

Aha, I had tried to get LoopVectorization working for this, but it was unhappy with how I wrote the loop. Meant to open an issue but didn’t get around to it

Elrod · January 13, 2021, 8:09pm

An issue would be welcome. I’m curious what went wrong.

Mason · January 13, 2021, 8:20pm

Opened and issue and then closed it because I realized what I did wrong.

For what it’s worth, I don’t see much performance advantage over the naive single threaded implementation on my computer (no avx 512 )

julia> function avxmax(f, x)
           m = -Inf
           @avx for i ∈ eachindex(x)
               m = max(m, f(x[i]))
           end
           m
       end
avxmax (generic function with 2 methods)

julia> @btime avxmax(f, 1:1000)
  11.770 μs (0 allocations: 0 bytes)
6.9043363990506466

I do see a benefit from manually inlining the function though (presumalby so that LV can replace cos and log with their fast versions):

julia> function avxmax_f(x)
           m = -Inf
           @avx for i ∈ eachindex(x)
               m = max(m, cos(x[i]) * log(x[i]))
           end
           m
       end;

julia> @btime avxmax_f(1:1000)
  8.036 μs (1 allocation: 16 bytes)
6.9043363990506466

Elrod · January 13, 2021, 8:26pm

=(

Mind running this:

julia> using VectorizationBase, SLEEFPirates

julia> vx = Vec(ntuple(_ -> 2rand(), VectorizationBase.pick_vector_width_val(Float64))...)
Vec{8,Float64}<1.1108350688792568, 1.340642196880145, 0.8214256639761288, 0.5208345017905756, 0.0878112025990947, 1.151752997769079, 1.1212221754046716, 0.9869372570799633>

julia> @benchmark sin($(Ref(vx))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.295 ns (0.00% GC)
  median time:      5.495 ns (0.00% GC)
  mean time:        5.506 ns (0.00% GC)
  maximum time:     15.802 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark log($(Ref(vx))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.866 ns (0.00% GC)
  median time:      7.895 ns (0.00% GC)
  mean time:        7.905 ns (0.00% GC)
  maximum time:     10.776 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> vxt = ntuple(vx, length(vx));

julia> @benchmark sin.($(Ref(vxt))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     45.142 ns (0.00% GC)
  median time:      48.400 ns (0.00% GC)
  mean time:        48.396 ns (0.00% GC)
  maximum time:     51.450 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

julia> @benchmark log.($(Ref(vxt))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     41.010 ns (0.00% GC)
  median time:      42.421 ns (0.00% GC)
  mean time:        42.233 ns (0.00% GC)
  maximum time:     52.753 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

I’m curious if one or both happen to be much slower on your computer. Maybe there’s something obvious to fix in SLEEFPirates.jl.

Elrod · January 13, 2021, 8:30pm

It’s not actually replacing any special functions syntactically at the moment (i.e., it’s not cos → cos_fast), just via dispatch.
But inlining still helps special functions in a loop, because there are a lot of constants used for evaluating them (polynomial coefficients in particular). If they get inlined, you get to hoist loading all of them out of the loop.

Mason · January 13, 2021, 8:33pm

Here’s what I get

julia> using VectorizationBase, SLEEFPirates

julia> vx = Vec(ntuple(_ -> 2rand(), VectorizationBase.pick_vector_width_val(Float64))...)
Vec{4,Float64}<0.9671328433704423, 0.3756828607095555, 0.5049952428293949, 1.6386679997110134>

julia> @benchmark sin($(Ref(vx))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.147 ns (0.00% GC)
  median time:      9.208 ns (0.00% GC)
  mean time:        9.606 ns (0.00% GC)
  maximum time:     19.309 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark log($(Ref(vx))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.109 ns (0.00% GC)
  median time:      10.350 ns (0.00% GC)
  mean time:        10.423 ns (0.00% GC)
  maximum time:     18.939 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> vxt = ntuple(vx, length(vx));

julia> @benchmark sin.($(Ref(vxt))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.345 ns (0.00% GC)
  median time:      17.735 ns (0.00% GC)
  mean time:        20.624 ns (0.00% GC)
  maximum time:     53.176 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark log.($(Ref(vxt))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.681 ns (0.00% GC)
  median time:      22.820 ns (0.00% GC)
  mean time:        25.763 ns (0.00% GC)
  maximum time:     56.024 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

Elrod · January 13, 2021, 9:13pm

Interesting. The scalar versions are about the same fast per element at 4 / (17-22ns) vs (8 / (41-45ns)), but the SIMD version is > 2x slower at 4 per 9-10ns vs 8 per 5-7ns.

I guess this is likely mostly a Zen1 problem, like when we had to special-case reduction handling for Zen1 when doing the sum benchmarks.
For anyone reading unfamiliar with the problem, Zen1 has halfrate AVX2 (they didn’t actually have 256 bit units, but 128 bit units working together to emulate 256 bits).
Hence while scalar performance is similar, it takes almost twice as long to evaluate SIMD code (9-10 vs 5-7 ns). And compared to a computer with AVX512, evaluating that SIMD code is also getting just half as much work done (4 instead of 8 elements).

I guess the inlined version is at least approaching 2x faster.

Out of curiosity, does avxmax(f, x) match avxmax_f when you define:

@inline f(i) = cos(i) * log(i);

?

EDIT:
I was benchmarking Mose’s f that used sin earlier. Seems cos is faster:

julia> @inline f_inline(i) = cos(i) * log(i);

julia> @inline f(i) = cos(i) * log(i);

julia> @btime avxmax(f, 1:1000)
  1.940 μs (0 allocations: 0 bytes)
6.9043363990506466

julia> @btime avxmax(f_inline, 1:1000)
  1.939 μs (0 allocations: 0 bytes)
6.9043363990506466

julia> function avxmax_f(x)
           m = -Inf
           @avx for i ∈ eachindex(x)
               m = max(m, cos(x[i]) * log(x[i]))
           end
           m
       end
avxmax_f (generic function with 1 method)

julia> @btime avxmax_f(1:1000)
  1.894 μs (0 allocations: 0 bytes)
6.9043363990506466

Mason · January 13, 2021, 9:15pm

Yes

julia> @inline f(i) = cos(i) * log(i);

julia> @btime avxmax(f, 1:1000)
  8.037 μs (0 allocations: 0 bytes)
6.9043363990506466

tkf · January 14, 2021, 12:54am

FYI there’s ThreadsX.maximum: https://github.com/tkf/ThreadsX.jl

I’d also recommend using data-parallel frameworks (e.g., Tullio.jl, JuliaFolds/*.jl, etc.) instead of using @threads for that tightly couples the execution mechanism and the algorithm. For example, it makes switching to GPU and/or distributed computing hard and very easy to introduce concurrency bugs. In the case of the example in the OP, it’d mean to use mapreduce(f, max, 1:100) (if you don’t already have parallelized version maximum).

matthieu · September 6, 2021, 4:14pm

As pointed out above, your second example is not thread safe. You need to use atomic_max!, e.g.

function find_max_threads_atomic()
    max_x = Threads.Atomic{Float64}(-Inf)
    Threads.@threads for i in 1:1000
        x = f(i)
        Threads.atomic_max!(max_x, x)
    end
    return max_x[]
end

Topic		Replies	Views
How to speed up this simple code? Multithreading, simd, inbounds Performance	39	7174	January 29, 2019
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4036	February 11, 2019
For loops acceleration General Usage performance	9	1255	May 9, 2019
Threads.@threads with ONE thread: how to remove the overhead? Performance threads	10	4073	April 2, 2021
Slowdown when multi-threading is used General Usage	8	2240	January 19, 2017

@threads for loop to find maximum of function

Related topics