Sum operations between arrays

Could you try replacing nchunks = Threads.nthreads() with nchunks = 2Threads.nthreads() (or maybe just make it a function argument)? Two tasks per thread will balance the load better. This wasn’t worth the overhead on my system, but it may be worth it with yours.

Yeah, that gives it a slight edge over my implementation now:

julia> @benchmark tproduct_threads(x)  setup=(x=rand(1000))
BenchmarkTools.Trial: 
  memory estimate:  7.65 MiB
  allocs estimate:  82
  --------------
  minimum time:     407.578 μs (0.00% GC)
  median time:      507.603 μs (0.00% GC)
  mean time:        791.596 μs (27.10% GC)
  maximum time:     2.968 ms (80.05% GC)
  --------------
  samples:          6270
  evals/sample:     1

julia> @benchmark tproduct_avx(x)  setup=(x=rand(1000))
BenchmarkTools.Trial: 
  memory estimate:  7.66 MiB
  allocs estimate:  88
  --------------
  minimum time:     411.538 μs (0.00% GC)
  median time:      530.083 μs (0.00% GC)
  mean time:        692.385 μs (22.83% GC)
  maximum time:     2.735 ms (42.88% GC)
  --------------
  samples:          7170
  evals/sample:     1

(the runtimes on mine improved here from above because I now use @avx on the broadcasts and updated LoopVectorization.jl)