Note that 14 and 12 nm Ryzen chips can only do 1 full width
fmaper clock cycle (and 2 loads), so they should see similar performance with the dot and selfdot. I haven’t verified this, but would like to hear from anyone who can.
I have a Ryzen Threadripper 2950X, which the AMD site says is 12 nm. Here’s my results for dot and selfdot:
julia> a = rand(256); b = rand(256);
julia> @btime mydot($a, $b)
32.365 ns (0 allocations: 0 bytes)
julia> @btime mydotavx($a, $b)
32.133 ns (0 allocations: 0 bytes)
julia> a = rand(43); b = rand(43);
julia> @btime mydot($a, $b)
13.141 ns (0 allocations: 0 bytes)
julia> @btime mydotavx($a, $b)
13.653 ns (0 allocations: 0 bytes)
julia> a = rand(256);
julia> @btime myselfdotavx($a)
19.957 ns (0 allocations: 0 bytes)
julia> @btime myselfdot($a)
21.234 ns (0 allocations: 0 bytes)
julia> @btime myselfdotavx($b)
11.322 ns (0 allocations: 0 bytes)
julia> @btime myselfdot($b)
11.674 ns (0 allocations: 0 bytes)