There were a few problems with your @threads benchmark.
A minor problem is that @inbounds doesn’t penetrate the closure, so you need to write it inside.
A much bigger problem is that the closure introduces a type instability, making it slow when single threaded and wrong when multilthreaded.
So lets fix those, and run a few benchmarks (throwing CheapThreads.jl and LoopVectorization.jl into the mix):
julia> using BenchmarkTools, CheapThreads, LoopVectorization
julia> Threads.nthreads()
1
julia> function single_thread(v)
           s = 0.0
           @inbounds for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
single_thread (generic function with 1 method)
julia> function single_thread_mt(v)
           s = zeros(8,Threads.nthreads())
           Threads.@threads for i = 1:length(v)
               @inbounds s[1,Threads.threadid()] += sin(v[i])
           end
           return sum(view(s, 1, :))
       end
single_thread_mt (generic function with 1 method)
julia> function single_batch(v)
           s = zeros(8,Threads.nthreads())
           @batch for i = 1:length(v)
               s[1,Threads.threadid()] += sin(v[i])
           end
           return sum(view(s, 1, :))
       end
single_batch (generic function with 1 method)
julia> function single_avxt(v)
           s = 0.0
           @avxt for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
single_avxt (generic function with 1 method)
julia> v = randn(100000);
julia> @btime single_thread($v)
  977.804 μs (0 allocations: 0 bytes)
102.03161503504539
julia> @btime single_thread_mt($v)
  991.968 μs (7 allocations: 704 bytes)
102.03161503504539
julia> @btime single_batch($v)
  1.016 ms (1 allocation: 144 bytes)
102.03161503504539
julia> @btime single_avxt($v)
  70.348 μs (0 allocations: 0 bytes)
102.03161503504623
julia> 977.8 / 70.348
13.899471200318416
julia> versioninfo()
Julia Version 1.7.0-DEV.802
Commit 8d998dc8ec* (2021-04-02 13:34 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36
The @threads version’s overhead was now fairly negligible.
@avxt should impose no overhead vs @avx if you started Julia with a single thread.
On top of that, it’s compatible with the original loop, not requiring the adjustments necessary to make @threads or CheapThreads.@batch work correctly.
On top of that, @avxt it is over 13x faster than the original loop on my AVX512 system  .
 .
Happened to have slightly lower error, too:
julia> sum(sin ∘ big, v) |> Float64
102.03161503504595
julia> 102.03161503504539 - ans, 102.03161503504623 - ans
(-5.542233338928781e-13, 2.8421709430404007e-13)
This is expected, as LoopVectorization should generally accumulate less rounding errors through the reduction vs the naive serial sum.
Multithreaded benchmarks:
julia> using BenchmarkTools, CheapThreads, LoopVectorization
julia> Threads.nthreads()
36
julia> function single_thread(v)
           s = 0.0
           @inbounds for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
single_thread (generic function with 1 method)
julia> function single_thread_mt(v)
           s = zeros(8,Threads.nthreads())
           Threads.@threads for i = 1:length(v)
               @inbounds s[1,Threads.threadid()] += sin(v[i])
           end
           return sum(view(s, 1, :))
       end
single_thread_mt (generic function with 1 method)
julia> function single_batch(v)
           s = zeros(8,Threads.nthreads())
           @batch for i = 1:length(v)
               s[1,Threads.threadid()] += sin(v[i])
           end
           return sum(view(s, 1, :))
       end
single_batch (generic function with 1 method)
julia> function single_avxt(v)
           s = 0.0
           @avxt for i = 1:length(v)
               s += sin(v[i])
           end
           return s
       end
single_avxt (generic function with 1 method)
julia> v = randn(100000);
julia> @btime single_thread($v)
  986.643 μs (0 allocations: 0 bytes)
152.32081377185875
julia> @btime single_thread_mt($v)
  99.950 μs (182 allocations: 18.50 KiB)
152.32081377185966
julia> @btime single_batch($v)
  40.756 μs (1 allocation: 2.38 KiB)
152.32081377185966
julia> @btime single_avxt($v)
  4.938 μs (0 allocations: 0 bytes)
152.32081377185906
@batch is well ahead of @threads, and of course @avxt leaves the others far behind as before – 5 microseconds is pretty good for multi-threaded code!! – while again being the most accurate:
julia> sum(sin ∘ big, v) |> Float64
152.32081377185915
julia> 152.32081377185875 - ans, 152.32081377185966 - ans, 152.32081377185906 - ans
(-3.979039320256561e-13, 5.115907697472721e-13, -8.526512829121202e-14)
I’d also be interested in seeing how the Ryzen 5950X compares to the Intel 10980XE here.