[ANN] LoopVectorization

Note that 14 and 12 nm Ryzen chips can only do 1 full width fma per clock cycle (and 2 loads), so they should see similar performance with the dot and selfdot. I haven’t verified this, but would like to hear from anyone who can.

I have a Ryzen Threadripper 2950X, which the AMD site says is 12 nm. Here’s my results for dot and selfdot:


julia> a = rand(256); b = rand(256);
julia> @btime mydot($a, $b)
  32.365 ns (0 allocations: 0 bytes)
julia> @btime mydotavx($a, $b)
  32.133 ns (0 allocations: 0 bytes)

julia> a = rand(43); b = rand(43);
julia> @btime mydot($a, $b)
  13.141 ns (0 allocations: 0 bytes)
julia> @btime mydotavx($a, $b)
  13.653 ns (0 allocations: 0 bytes)

julia> a = rand(256);
julia> @btime myselfdotavx($a)
  19.957 ns (0 allocations: 0 bytes)
julia> @btime myselfdot($a)
  21.234 ns (0 allocations: 0 bytes)
julia> @btime myselfdotavx($b)
  11.322 ns (0 allocations: 0 bytes)
julia> @btime myselfdot($b)
  11.674 ns (0 allocations: 0 bytes)
1 Like