There were a few problems with your `@threads`

benchmark.

A minor problem is that `@inbounds`

doesn’t penetrate the closure, so you need to write it inside.

A much bigger problem is that the closure introduces a type instability, making it slow when single threaded and wrong when multilthreaded.

So lets fix those, and run a few benchmarks (throwing `CheapThreads.jl`

and `LoopVectorization.jl`

into the mix):

```
julia> using BenchmarkTools, CheapThreads, LoopVectorization
julia> Threads.nthreads()
1
julia> function single_thread(v)
s = 0.0
@inbounds for i = 1:length(v)
s += sin(v[i])
end
return s
end
single_thread (generic function with 1 method)
julia> function single_thread_mt(v)
s = zeros(8,Threads.nthreads())
Threads.@threads for i = 1:length(v)
@inbounds s[1,Threads.threadid()] += sin(v[i])
end
return sum(view(s, 1, :))
end
single_thread_mt (generic function with 1 method)
julia> function single_batch(v)
s = zeros(8,Threads.nthreads())
@batch for i = 1:length(v)
s[1,Threads.threadid()] += sin(v[i])
end
return sum(view(s, 1, :))
end
single_batch (generic function with 1 method)
julia> function single_avxt(v)
s = 0.0
@avxt for i = 1:length(v)
s += sin(v[i])
end
return s
end
single_avxt (generic function with 1 method)
julia> v = randn(100000);
julia> @btime single_thread($v)
977.804 μs (0 allocations: 0 bytes)
102.03161503504539
julia> @btime single_thread_mt($v)
991.968 μs (7 allocations: 704 bytes)
102.03161503504539
julia> @btime single_batch($v)
1.016 ms (1 allocation: 144 bytes)
102.03161503504539
julia> @btime single_avxt($v)
70.348 μs (0 allocations: 0 bytes)
102.03161503504623
julia> 977.8 / 70.348
13.899471200318416
julia> versioninfo()
Julia Version 1.7.0-DEV.802
Commit 8d998dc8ec* (2021-04-02 13:34 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
```

The `@threads`

version’s overhead was now fairly negligible.

`@avxt`

should impose no overhead vs `@avx`

if you started Julia with a single thread.

On top of that, it’s compatible with the original loop, not requiring the adjustments necessary to make `@threads`

or `CheapThreads.@batch`

work correctly.

On top of that, `@avxt`

it is over 13x faster than the original loop on my AVX512 system .

Happened to have slightly lower error, too:

```
julia> sum(sin ∘ big, v) |> Float64
102.03161503504595
julia> 102.03161503504539 - ans, 102.03161503504623 - ans
(-5.542233338928781e-13, 2.8421709430404007e-13)
```

This is expected, as LoopVectorization should generally accumulate less rounding errors through the reduction vs the naive serial sum.

Multithreaded benchmarks:

```
julia> using BenchmarkTools, CheapThreads, LoopVectorization
julia> Threads.nthreads()
36
julia> function single_thread(v)
s = 0.0
@inbounds for i = 1:length(v)
s += sin(v[i])
end
return s
end
single_thread (generic function with 1 method)
julia> function single_thread_mt(v)
s = zeros(8,Threads.nthreads())
Threads.@threads for i = 1:length(v)
@inbounds s[1,Threads.threadid()] += sin(v[i])
end
return sum(view(s, 1, :))
end
single_thread_mt (generic function with 1 method)
julia> function single_batch(v)
s = zeros(8,Threads.nthreads())
@batch for i = 1:length(v)
s[1,Threads.threadid()] += sin(v[i])
end
return sum(view(s, 1, :))
end
single_batch (generic function with 1 method)
julia> function single_avxt(v)
s = 0.0
@avxt for i = 1:length(v)
s += sin(v[i])
end
return s
end
single_avxt (generic function with 1 method)
julia> v = randn(100000);
julia> @btime single_thread($v)
986.643 μs (0 allocations: 0 bytes)
152.32081377185875
julia> @btime single_thread_mt($v)
99.950 μs (182 allocations: 18.50 KiB)
152.32081377185966
julia> @btime single_batch($v)
40.756 μs (1 allocation: 2.38 KiB)
152.32081377185966
julia> @btime single_avxt($v)
4.938 μs (0 allocations: 0 bytes)
152.32081377185906
```

`@batch`

is well ahead of `@threads`

, and of course `@avxt`

leaves the others far behind as before – 5 microseconds is pretty good for multi-threaded code!! – while again being the most accurate:

```
julia> sum(sin ∘ big, v) |> Float64
152.32081377185915
julia> 152.32081377185875 - ans, 152.32081377185966 - ans, 152.32081377185906 - ans
(-3.979039320256561e-13, 5.115907697472721e-13, -8.526512829121202e-14)
```

I’d also be interested in seeing how the Ryzen 5950X compares to the Intel 10980XE here.