I’m at work now, where I only have easy access to avx2 (although I could use aws).
I wouldn’t say it makes AVX useless. If you’re really curious, you can take a look through Agner Fog’s instruction tables. Roughly in the middle of page 269, it says VGATHERQPD
has a reciprocal throughput (RT) of about 5 clock cycles (meaning the CPU could do about 200 of them every 1000 cycles).
You can think of reciprocal throughput as an average amount of time it takes. The actual number of cycles an instruction takes will generally be longer, but a single core can normally work on multiple in parallel. If code can be run out of order, reciprocal throughput may be a better indicator. (Memory loads can be evaluated out of order, since we’re not writing to memory.)
For comparison, on page 267, VMOVUPS/D
operating on v,m
(any vector and memory) has a RT of 0.5 clock cycles – 10 times faster.
Additionally, all add/multiply/fused multiply add instructions have RTs of 0.5-1 cycle.
Gather is a steep price to pay. To gain a speedup, the rest of vectorization will have to speed things up enough to be worth it.
For example, running on an AWS instance (to get avx512 at work):
julia> using LoopVectorization, BenchmarkTools
julia> A = rand(100,100); B1 = similar(A); B2 = similar(A);
julia> @benchmark @. $B1 = $A + $A'
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.882 μs (0.00% GC)
median time: 6.932 μs (0.00% GC)
mean time: 6.986 μs (0.00% GC)
maximum time: 20.675 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 5
julia> @benchmark @avx @. $B2 = $A + $A'
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.915 μs (0.00% GC)
median time: 3.947 μs (0.00% GC)
mean time: 3.967 μs (0.00% GC)
maximum time: 7.984 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
@avx
uses vgatherqpd
here, and does get a speedup.
However, it is slower on the avx2 computer I tried it on (7.5 vs 10 microseconds, instead of 6.9 vs 4).
In your example, it tried to vectorize the inner loop which had dim = 10
. 10 is a little awkward for vectors of length 8.
Setting dim = 30
(again on AWS):
julia> maxdeg = 20; dim =30;
julia> nbasis = 1_000;
julia> P = rand(dim, maxdeg + 1);
julia> function mvp(P, basis, coeffs::Vector{T}) where {T}
len_c = length(coeffs)
len_P = size(P, 1)
p = zero(T)
for n = 1:len_c
pn = coeffs[n]
for a = 1:len_P
pn *= P[a, basis[a, n]]
end
p += pn
end
return p
end
mvp (generic function with 1 method)
julia> function mvpinbounds(P, basis, coeffs::Vector{T}) where {T}
len_c = length(coeffs)
len_P = size(P, 1)
p = zero(T)
@inbounds for n = 1:len_c
pn = coeffs[n]
for a = 1:len_P
pn *= P[a, basis[a, n]]
end
p += pn
end
return p
end
mvpinbounds (generic function with 1 method)
julia> function mvpavx(P, basis, coeffs::Vector{T}) where {T}
len_c = length(coeffs)
len_P = size(P, 1)
p = zero(T)
@avx for n = 1:len_c
pn = coeffs[n]
for a = 1:len_P
pn *= P[a, basis[a, n]]
end
p += pn
end
return p
end
mvpavx (generic function with 1 method)
julia> basis = rand(1:(maxdeg+1), (dim, nbasis));
julia> coeffs = rand(nbasis);
julia> @btime mvp($P, $basis, $coeffs)
40.369 μs (0 allocations: 0 bytes)
8.855318486319671e-7
julia> @btime mvpinbounds($P, $basis, $coeffs)
28.164 μs (0 allocations: 0 bytes)
8.855318486319671e-7
julia> @btime mvpavx($P, $basis, $coeffs)
19.370 μs (0 allocations: 0 bytes)
8.855318486319674e-7
30 (4x8, 2 wasted) is less awkward than 10 (2x8, 6 “wasted”). 30/32 was useful, vs 10/16.
I think that’s why your second version is faster (len_c is much larger than len_P).
I also see about a 2x improvement when dim = 10
. (Also, note you meant to use basist
instead of basis
as an argument.)
While I haven’t really been using it, the plumbing is there for LoopVectorization to use static sizes / size hints to make optimization decisions (I’ve not made it take that into account intelligently yet, but it should be simple).
What would be a nice API for passing “size hints” that it will use when optimizing (generated loops will remain dynamic)?
Knowing that one dimension is much longer, it should be able to figure out on its own that it wants to vectorize along that much longer dimension instead of the short one. It won’t be able to transpose matrices for you, but that shouldn’t be as big a deal, since you’re gathering from it anyway.
It may depend on architecture, but definitely worth trying in my opinion.
I think the gather would likely be faster. I’m not sure about scatter.
I haven’t tested scatter as much. Please file an issue with a small example. I’ll get it working and add it to the tests.