Yes, on an old-ish i7-4790k Haswell, they have more comparable performance :
julia> @benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 19.600 μs … 162.000 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 19.700 μs ┊ GC (median): 0.00%
Time (mean ± σ): 19.822 μs ± 2.164 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅ █ ▆ ▂ ▁ ▁
█▁▁█▁▁▁█▁▁█▁▁▁▇▁▁▄▁▁▁▁▁▁▄▁▁▁▇▁▁█▁▁▁█▁▁▇▁▁▁▅▁▁▅▁▁▁▅▁▁▃▁▁▁▇▁▁▆ █
19.6 μs Histogram: log(frequency) by time 21.3 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 17.300 μs … 32.100 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 17.400 μs ┊ GC (median): 0.00%
Time (mean ± σ): 17.434 μs ± 370.649 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆ █ ▆ ▁ ▁
█▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▆▁▁▁▁▁▄▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▇▁▁▁▁▁█▁▁▁▁▁▇ █
17.3 μs Histogram: log(frequency) by time 18.3 μs <
Memory estimate: 0 bytes, allocs estimate: 0.