Masked loads and stores are also LLVM intrinsics, albeit only defined for pointers. That is what I was using for MMatrices. In my opening post, I noted that somehow hcat
receives pointers to SArrays and uses load / store operations on them.
Meaning - it seems to me - it should be possible for us to also get a pointer, and use these masked intrinsics.
But, I like the shufflevector idea.
An immediate problem unfortunately seems to be that Julia interprets some NTuple{N,VecElement{T}}
s as LLVM-arrays instead of LLVM-vectors (depending on the N
):
julia> A = @SMatrix randn(7,7);
julia> Avec = to_vec(A.data);
julia> typeof(Avec)
NTuple{49,VecElement{Float64}}
julia> SIMDPirates.shufflevector(Avec, Val{(0,1,2,3,4,5,6,6)}())
ERROR: error compiling shufflevector: Failed to parse LLVM Assembly:
julia: llvmcall:3:36: error: '%0' defined with type '[49 x double]'
%res = shufflevector <49 x double> %0, <49 x double> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 6>
^
Stacktrace:
[1] top-level scope at none:0
I’ll make a shufflevector method that accepts NTuple{N,T}s (always arrays) and then bitcasts to vectors. Hopefully that gets compiled away.
But that leaves the store / constructing the output SMatrix.
Of course, I only actually need one masked load operation / shuffle on a load.
In this example, @code_native
shows it was not replaced with a masked move:
@generated function mul2(A::SMatrix{7,N,T}, B::SMatrix{N,P,T}) where {N,P,T}
out_tup = :(C_1[2].value, C_1[3].value)
for m ∈ 4:8
push!(out_tup.args, :(C_1[$m].value))
end
for p ∈ 2:P, m ∈ 2:8
push!(out_tup.args, :($(Symbol(:C_,p))[$m].value))
end
V = Vec{8,Float64}
mask = (true, true, true, true, true, true, true, false)
quote
Acol = SIMDPirates.shufflevector(
vload2($V, A, 0),
(VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0)),
Val{(8,0,1,2,3,4,5,6)}())
@nexprs $P p -> C_p = SIMDPirates.vmul(Acol, vbroadcast($V, B[1,p]))
@nexprs $(N-1) n -> begin
Acol = vload2($V, A, 7n-1)
@nexprs $P p -> begin
C_p = vfma(Acol, vbroadcast($V, B[n+1,p]), C_p)
end
end
SMatrix{7,$P,$T}($out_tup)
end
end
Instead, we still have vinsertf128 and vinsertf64x4:
julia> A = @SMatrix randn(7,7);
julia> B = @SMatrix randn(7,7);
julia> @code_native mul2(A, B)
.text
; Function mul2 {
; Location: REPL[113]:2
; Function macro expansion; {
; Location: REPL[113]:12
; Function shufflevector; {
; Location: shufflevector.jl:20
; Function macro expansion; {
; Location: REPL[113]:2
vmovsd (%rsi), %xmm0 # xmm0 = mem[0],zero
vpslldq $8, %xmm0, %xmm0 # xmm0 = zero,zero,zero,zero,zero,zero,zero,zero,xmm0[0,1,2,3,4,5,6,7]
vinsertf128 $1, 8(%rsi), %ymm0, %ymm0
vinsertf64x4 $1, 24(%rsi), %zmm0, %zmm0
;}}
#... (cut rest of function)
julia> @btime mul2($A, $B);
16.619 ns (0 allocations: 0 bytes)
But (with doing this on only a single load) it is a little faster. The storing still adds 5ns / roughly 50% runtime compared to the MMatrix version.
Unfortunately both vector arguments in a shuffle have to be the same length, but I can also try using those for assembling the SMatrix (combine columns 1&2, 3&4, 5&6, grow 7, combine (1&2)&(3&4),…).
I’m also inclined to create my own SMatrix type, where the tuple length is a multiple of SIMD register width (ie, 4 or 8 depending on avx2 / avx512). If we can’t liberally use the masked operations, they will be more efficient for most operations.