Could you try replacing nchunks = Threads.nthreads()
with nchunks = 2Threads.nthreads()
(or maybe just make it a function argument)? Two tasks per thread will balance the load better. This wasn’t worth the overhead on my system, but it may be worth it with yours.
Yeah, that gives it a slight edge over my implementation now:
julia> @benchmark tproduct_threads(x) setup=(x=rand(1000))
BenchmarkTools.Trial:
memory estimate: 7.65 MiB
allocs estimate: 82
--------------
minimum time: 407.578 μs (0.00% GC)
median time: 507.603 μs (0.00% GC)
mean time: 791.596 μs (27.10% GC)
maximum time: 2.968 ms (80.05% GC)
--------------
samples: 6270
evals/sample: 1
julia> @benchmark tproduct_avx(x) setup=(x=rand(1000))
BenchmarkTools.Trial:
memory estimate: 7.66 MiB
allocs estimate: 88
--------------
minimum time: 411.538 μs (0.00% GC)
median time: 530.083 μs (0.00% GC)
mean time: 692.385 μs (22.83% GC)
maximum time: 2.735 ms (42.88% GC)
--------------
samples: 7170
evals/sample: 1
(the runtimes on mine improved here from above because I now use @avx
on the broadcasts and updated LoopVectorization.jl)