Help defining (masked) vload and vstore operations for SArrays (or other isbits structs) using llvmcall

Masked loads and stores are also LLVM intrinsics, albeit only defined for pointers. That is what I was using for MMatrices. In my opening post, I noted that somehow hcat receives pointers to SArrays and uses load / store operations on them.
Meaning - it seems to me - it should be possible for us to also get a pointer, and use these masked intrinsics.

But, I like the shufflevector idea.
An immediate problem unfortunately seems to be that Julia interprets some NTuple{N,VecElement{T}}s as LLVM-arrays instead of LLVM-vectors (depending on the N):

julia> A = @SMatrix randn(7,7);

julia> Avec = to_vec(A.data);

julia> typeof(Avec)
NTuple{49,VecElement{Float64}}

julia> SIMDPirates.shufflevector(Avec, Val{(0,1,2,3,4,5,6,6)}())
ERROR: error compiling shufflevector: Failed to parse LLVM Assembly: 
julia: llvmcall:3:36: error: '%0' defined with type '[49 x double]'
%res = shufflevector <49 x double> %0, <49 x double> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 6>
                                   ^

Stacktrace:
 [1] top-level scope at none:0

I’ll make a shufflevector method that accepts NTuple{N,T}s (always arrays) and then bitcasts to vectors. Hopefully that gets compiled away.

But that leaves the store / constructing the output SMatrix.

Of course, I only actually need one masked load operation / shuffle on a load.
In this example, @code_native shows it was not replaced with a masked move:

@generated function mul2(A::SMatrix{7,N,T}, B::SMatrix{N,P,T}) where {N,P,T}
    out_tup = :(C_1[2].value, C_1[3].value)
    for m ∈ 4:8
        push!(out_tup.args, :(C_1[$m].value))
    end
    for p ∈ 2:P, m ∈ 2:8
        push!(out_tup.args, :($(Symbol(:C_,p))[$m].value))
    end
    V = Vec{8,Float64}
    mask = (true, true, true, true, true, true, true, false)
    quote
        Acol = SIMDPirates.shufflevector(
            vload2($V, A, 0),
            (VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0)),
            Val{(8,0,1,2,3,4,5,6)}())
        @nexprs $P p -> C_p = SIMDPirates.vmul(Acol, vbroadcast($V, B[1,p]))
        @nexprs $(N-1) n -> begin
            Acol = vload2($V, A, 7n-1)
            @nexprs $P p -> begin
                C_p = vfma(Acol, vbroadcast($V, B[n+1,p]), C_p)
            end
        end
        SMatrix{7,$P,$T}($out_tup)
    end
end

Instead, we still have vinsertf128 and vinsertf64x4:

julia> A = @SMatrix randn(7,7);

julia> B = @SMatrix randn(7,7);

julia> @code_native mul2(A, B)
	.text
; Function mul2 {
; Location: REPL[113]:2
; Function macro expansion; {
; Location: REPL[113]:12
; Function shufflevector; {
; Location: shufflevector.jl:20
; Function macro expansion; {
; Location: REPL[113]:2
	vmovsd	(%rsi), %xmm0           # xmm0 = mem[0],zero
	vpslldq	$8, %xmm0, %xmm0        # xmm0 = zero,zero,zero,zero,zero,zero,zero,zero,xmm0[0,1,2,3,4,5,6,7]
	vinsertf128	$1, 8(%rsi), %ymm0, %ymm0
	vinsertf64x4	$1, 24(%rsi), %zmm0, %zmm0
;}}
#... (cut rest of function)

julia> @btime mul2($A, $B);
  16.619 ns (0 allocations: 0 bytes)

But (with doing this on only a single load) it is a little faster. The storing still adds 5ns / roughly 50% runtime compared to the MMatrix version.

Unfortunately both vector arguments in a shuffle have to be the same length, but I can also try using those for assembling the SMatrix (combine columns 1&2, 3&4, 5&6, grow 7, combine (1&2)&(3&4),…).

I’m also inclined to create my own SMatrix type, where the tuple length is a multiple of SIMD register width (ie, 4 or 8 depending on avx2 / avx512). If we can’t liberally use the masked operations, they will be more efficient for most operations.