[ANN] LoopVectorization

cscherrer · January 2, 2020, 12:38am

Note that 14 and 12 nm Ryzen chips can only do 1 full width fma per clock cycle (and 2 loads), so they should see similar performance with the dot and selfdot. I haven’t verified this, but would like to hear from anyone who can.

I have a Ryzen Threadripper 2950X, which the AMD site says is 12 nm. Here’s my results for dot and selfdot:


julia> a = rand(256); b = rand(256);
julia> @btime mydot($a, $b)
  32.365 ns (0 allocations: 0 bytes)
julia> @btime mydotavx($a, $b)
  32.133 ns (0 allocations: 0 bytes)

julia> a = rand(43); b = rand(43);
julia> @btime mydot($a, $b)
  13.141 ns (0 allocations: 0 bytes)
julia> @btime mydotavx($a, $b)
  13.653 ns (0 allocations: 0 bytes)

julia> a = rand(256);
julia> @btime myselfdotavx($a)
  19.957 ns (0 allocations: 0 bytes)
julia> @btime myselfdot($a)
  21.234 ns (0 allocations: 0 bytes)
julia> @btime myselfdotavx($b)
  11.322 ns (0 allocations: 0 bytes)
julia> @btime myselfdot($b)
  11.674 ns (0 allocations: 0 bytes)

Topic		Replies	Views
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2258	March 17, 2021
Help Improving Performance of a Loop Performance performance , loops	15	1149	February 16, 2021
Vectorization of "complex" loops New to Julia	15	524	September 27, 2025
Simple Mat-Vec multiply (understanding performance, without the bugs) Performance tullio	16	3391	August 12, 2020
Julia matrix-multiplication performance Performance linearalgebra	20	9060	October 30, 2022

[ANN] LoopVectorization

Related topics