 # @threads for loop to find maximum of function

The following code causes a type instability. Is there a better way to write this code? Perhaps using Threads.Atomic?

``````max_obj = -Inf64
xs = 0.0
obj, x = f(i)
if obj >= max_obj
max_obj = obj
xs = x
end
end
``````

How can this work with multi-threading? Also, why type-instability?

The goal is to find the value of i that maximizes f(i) in parallel with multi-threading without explicitly allocating an array to store all the values of f(i).

I understand the goal, but `max_obj` and `xs` are scalar values which are accessed and modified by multiple threads at the same time

Yes, I understand that a race condition could occur, so I’m wondering how to do this correctly without explicitly allocating any memory?

For example, I could do

``````fv = zeros(100)
x = zeros(100)
obj[i], x[i] = f(i)
end
maximum(fv)
``````

but I want to avoid the overhead of allocating the arrays `fv` and `x`.

Ok, so you want to parallelise only the evaluation of the function. Note that you don’t need to allocate an accumulator of 100 elements, you only need one of `nthreads()`:

``````julia> using Base.Threads, BenchmarkTools

julia> f(i) = sin(i) * log(i)
f (generic function with 1 method)

x = f(i)
end
end
return maximum(max_xs)
end
find_max_threads_acc (generic function with 1 method)

9.303 μs (22 allocations: 1.97 KiB)
6.892393151350101
``````

If you want to use atomics, you can do for example

``````julia> function find_max_threads_atomic()
x = f(i)
if x > max_x[]
max_x[] = x
end
end
return max_x[]
end
find_max_threads_atomic (generic function with 1 method)

8.513 μs (22 allocations: 1.88 KiB)
6.892393151350101
``````

``````julia> function find_max_no_threads()
max_x = -Inf
for i in 1:1000
x = f(i)
if x > max_x
max_x = x
end
end
return max_x
end
find_max_no_threads (generic function with 1 method)

22.896 μs (0 allocations: 0 bytes)
6.892393151350101
``````

I’m sure someone will come up with other nicer solutions using higher-level packages

2 Likes

Thank you! I’m wondering if the following can cause a race condition?

``````if x > max_x[]
max_x[] = x
end
``````

Let’s say we have max_x < x1 < x2 and we calculate x1 and x2 on different threads, then x2 and x1 both pass the if statement at the same time before max_x is updated but max_x := x2 first and then max_x := x1 happens, resulting in the wrong answer.

I know we could use `Threads.atomic_max!` instead but I also want to find the argmax that maximizes the function so that doesn’t solve the issue.

My solution is to use Tullio.jl

``````julia> using Tullio

julia> f(i) = cos(i) * log(i);

threadmax (generic function with 1 method)

10.030 μs (91 allocations: 4.25 KiB)
6.904336399050647
``````

For comparison, here are Mose’s results on my computer:

``````julia> @btime find_max_threads_acc()
12.080 μs (31 allocations: 2.84 KiB)
6.892393151350101

14.920 μs (0 allocations: 0 bytes)
6.892393151350101
``````
2 Likes

On my system, timings with 4 threads give the opposite result

``````julia> @btime threadmax(f, 1:1000)
9.807 μs (92 allocations: 4.28 KiB)
6.904336399050647

9.408 μs (23 allocations: 1.98 KiB)
6.904336399050647

8.599 μs (21 allocations: 1.84 KiB)
6.904336399050647
``````

but the Tullio-based solution is definitely nicer to write

That example is a little small for threading.

``````julia> Threads.nthreads(), Sys.CPU_THREADS
(36, 36)

54.724 μs (820 allocations: 38.41 KiB)
6.892393151350101

75.373 μs (182 allocations: 16.47 KiB)
6.892393151350101

18.232 μs (0 allocations: 0 bytes)
6.892393151350101

julia> function avxmax(f, x)
m = -Inf
@avx for i ∈ eachindex(x)
m = max(m, f(x[i]))
end
m
end
avxmax (generic function with 1 method)

julia> @btime avxmax(f, 1:1000)
2.324 μs (0 allocations: 0 bytes)
6.8923931513501
``````

SIMD is much faster here.

3 Likes

Aha, I had tried to get LoopVectorization working for this, but it was unhappy with how I wrote the loop. Meant to open an issue but didn’t get around to it

An issue would be welcome. I’m curious what went wrong.

Opened and issue and then closed it because I realized what I did wrong.

For what it’s worth, I don’t see much performance advantage over the naive single threaded implementation on my computer (no avx 512 )

``````julia> function avxmax(f, x)
m = -Inf
@avx for i ∈ eachindex(x)
m = max(m, f(x[i]))
end
m
end
avxmax (generic function with 2 methods)

julia> @btime avxmax(f, 1:1000)
11.770 μs (0 allocations: 0 bytes)
6.9043363990506466
``````

I do see a benefit from manually inlining the function though (presumalby so that LV can replace `cos` and `log` with their fast versions):

``````julia> function avxmax_f(x)
m = -Inf
@avx for i ∈ eachindex(x)
m = max(m, cos(x[i]) * log(x[i]))
end
m
end;

julia> @btime avxmax_f(1:1000)
8.036 μs (1 allocation: 16 bytes)
6.9043363990506466
``````

=(

Mind running this:

``````julia> using VectorizationBase, SLEEFPirates

julia> vx = Vec(ntuple(_ -> 2rand(), VectorizationBase.pick_vector_width_val(Float64))...)
Vec{8,Float64}<1.1108350688792568, 1.340642196880145, 0.8214256639761288, 0.5208345017905756, 0.0878112025990947, 1.151752997769079, 1.1212221754046716, 0.9869372570799633>

julia> @benchmark sin(\$(Ref(vx))[])
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     5.295 ns (0.00% GC)
median time:      5.495 ns (0.00% GC)
mean time:        5.506 ns (0.00% GC)
maximum time:     15.802 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     999

julia> @benchmark log(\$(Ref(vx))[])
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     7.866 ns (0.00% GC)
median time:      7.895 ns (0.00% GC)
mean time:        7.905 ns (0.00% GC)
maximum time:     10.776 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     999

julia> vxt = ntuple(vx, length(vx));

julia> @benchmark sin.(\$(Ref(vxt))[])
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     45.142 ns (0.00% GC)
median time:      48.400 ns (0.00% GC)
mean time:        48.396 ns (0.00% GC)
maximum time:     51.450 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     988

julia> @benchmark log.(\$(Ref(vxt))[])
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     41.010 ns (0.00% GC)
median time:      42.421 ns (0.00% GC)
mean time:        42.233 ns (0.00% GC)
maximum time:     52.753 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     990
``````

I’m curious if one or both happen to be much slower on your computer. Maybe there’s something obvious to fix in SLEEFPirates.jl.

1 Like

It’s not actually replacing any special functions syntactically at the moment (i.e., it’s not `cos` -> `cos_fast`), just via dispatch.
But inlining still helps special functions in a loop, because there are a lot of constants used for evaluating them (polynomial coefficients in particular). If they get inlined, you get to hoist loading all of them out of the loop.

3 Likes

Here’s what I get

``````julia> using VectorizationBase, SLEEFPirates

julia> vx = Vec(ntuple(_ -> 2rand(), VectorizationBase.pick_vector_width_val(Float64))...)
Vec{4,Float64}<0.9671328433704423, 0.3756828607095555, 0.5049952428293949, 1.6386679997110134>

julia> @benchmark sin(\$(Ref(vx))[])
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     9.147 ns (0.00% GC)
median time:      9.208 ns (0.00% GC)
mean time:        9.606 ns (0.00% GC)
maximum time:     19.309 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     998

julia> @benchmark log(\$(Ref(vx))[])
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     10.109 ns (0.00% GC)
median time:      10.350 ns (0.00% GC)
mean time:        10.423 ns (0.00% GC)
maximum time:     18.939 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     999

julia> vxt = ntuple(vx, length(vx));

julia> @benchmark sin.(\$(Ref(vxt))[])
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     17.345 ns (0.00% GC)
median time:      17.735 ns (0.00% GC)
mean time:        20.624 ns (0.00% GC)
maximum time:     53.176 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     998

julia> @benchmark log.(\$(Ref(vxt))[])
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     22.681 ns (0.00% GC)
median time:      22.820 ns (0.00% GC)
mean time:        25.763 ns (0.00% GC)
maximum time:     56.024 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     996

``````
1 Like

Interesting. The scalar versions are about the same fast per element at 4 / (17-22ns) vs (8 / (41-45ns)), but the SIMD version is > 2x slower at 4 per 9-10ns vs 8 per 5-7ns.

I guess this is likely mostly a Zen1 problem, like when we had to special-case reduction handling for `Zen1` when doing the `sum` benchmarks.
For anyone reading unfamiliar with the problem, Zen1 has halfrate AVX2 (they didn’t actually have 256 bit units, but 128 bit units working together to emulate 256 bits).
Hence while scalar performance is similar, it takes almost twice as long to evaluate SIMD code (9-10 vs 5-7 ns). And compared to a computer with AVX512, evaluating that SIMD code is also getting just half as much work done (4 instead of 8 elements).

I guess the inlined version is at least approaching 2x faster.

Out of curiosity, does `avxmax(f, x)` match `avxmax_f` when you define:

``````@inline f(i) = cos(i) * log(i);
``````

?

EDIT:
I was benchmarking Mose’s `f` that used `sin` earlier. Seems `cos` is faster:

``````julia> @inline f_inline(i) = cos(i) * log(i);

julia> @inline f(i) = cos(i) * log(i);

julia> @btime avxmax(f, 1:1000)
1.940 μs (0 allocations: 0 bytes)
6.9043363990506466

julia> @btime avxmax(f_inline, 1:1000)
1.939 μs (0 allocations: 0 bytes)
6.9043363990506466

julia> function avxmax_f(x)
m = -Inf
@avx for i ∈ eachindex(x)
m = max(m, cos(x[i]) * log(x[i]))
end
m
end
avxmax_f (generic function with 1 method)

julia> @btime avxmax_f(1:1000)
1.894 μs (0 allocations: 0 bytes)
6.9043363990506466
``````
2 Likes

Yes

``````julia> @inline f(i) = cos(i) * log(i);

julia> @btime avxmax(f, 1:1000)
8.037 μs (0 allocations: 0 bytes)
6.9043363990506466
``````
1 Like

FYI there’s `ThreadsX.maximum`: GitHub - tkf/ThreadsX.jl: Parallelized Base functions

I’d also recommend using data-parallel frameworks (e.g., Tullio.jl, JuliaFolds/*.jl, etc.) instead of using `@threads for` that tightly couples the execution mechanism and the algorithm. For example, it makes switching to GPU and/or distributed computing hard and very easy to introduce concurrency bugs. In the case of the example in the OP, it’d mean to use `mapreduce(f, max, 1:100)` (if you don’t already have parallelized version `maximum`).

5 Likes