Help defining (masked) vload and vstore operations for SArrays (or other isbits structs) using llvmcall

Sadly, I think that is the best I can do.
I was hoping I could use alloca to get a pointer, store the variable there, and finally load it – and have llvm eliminate the allocation. However, that does not seem to be the case:

using StaticArrays
a = @SVector randn(24);
@inline function llvmzload(A::SArray{Tuple{24}, Float64, 1, 24}, i::Int)
    #= REPL[150]:29 =#
    Base.llvmcall(("",
    """%sptr = alloca { [24 x double] }
    store { [24 x double] } %0, { [24 x double] }* %sptr
    %elptr = getelementptr { [24 x double] }, { [24 x double] }* %sptr, i32 0, i32 0, i64 %1
    %ptr = bitcast double* %elptr to <8 x double>*
    %res = load <8 x double>, <8 x double>* %ptr, align 8
    ret <8 x double> %res"""), Vec{8, Float64}, Tuple{SArray{Tuple{24}, Float64, 1, 24}, Int}, A, i)
end

yields:

julia> @code_native llvmzload(a, 7)
	.text
; Function llvmzload {
; Location: REPL[162]:3
	subq	$64, %rsp
	vmovups	(%rdi), %zmm0
	vmovups	64(%rdi), %zmm1
	vmovups	128(%rdi), %zmm2
	vmovups	%zmm0, -128(%rsp)
	vmovups	%zmm1, -64(%rsp)
	vmovups	%zmm2, (%rsp)
	vmovups	-128(%rsp,%rsi,8), %zmm0
	addq	$64, %rsp
	retq
	nopl	(%rax)
;}

I was basing this off a 10 year old email that said llvm has no address-of operator, unlike C (& Co):

You probably already know this, but I want to state it just in case.
LLVM has no address-of operator, so if you see an instruction like “%sum
= add i32 %x, %y” you can’t get the address of %sum (it’s a register,
after all).

The C frontend doesn’t interpret “int32_t i;” as creating an LLVM
variable of type i32; it actually does “i32* %i_addr = alloca i32”.
LLVM’s optimizers will remove the alloca and lower it to registers if
nobody takes its address.

So I was hoping it would eliminate the moves to %rsp, and just do a single move from %rdi.
If I try to use @asmcall instead, could that function still be inlined into the calling functions?

The problem with those simple tuple functions is that I’ve never seen the compiler use a mask instruction.

julia> m2 = (true, true, true, true, true, true, true, false)
(true, true, true, true, true, true, true, false)

julia> @inline function vload2(::Type{Vec{N,T}}, x, ::Val{mask}) where {N,T,mask}
           ntuple(n -> mask[n] ? VE(T(x[n])) : VE(zero(T)), Val(N))
       end
vload2 (generic function with 2 methods)

julia> vload2(Vec{8,Float64}, a, Val(m2))
(VecElement{Float64}(0.1341517372061226), VecElement{Float64}(-1.4478062620335768), VecElement{Float64}(-0.830543624514204), VecElement{Float64}(0.9402472351085811), VecElement{Float64}(-0.5112288316089226), VecElement{Float64}(0.19750599041128963), VecElement{Float64}(0.9528499075643893), VecElement{Float64}(0.0))

julia> @code_native vload2(Vec{8,Float64}, a, Val(m2))
	.text
; Function vload2 {
; Location: REPL[165]:1
; Function ntuple; {
; Location: sysimg.jl:271
; Function macro expansion; {
; Location: REPL[165]:1
	vmovups	(%rsi), %ymm0
	vmovups	32(%rsi), %xmm1
	vmovsd	48(%rsi), %xmm2         # xmm2 = mem[0],zero
	vinsertf128	$1, %xmm2, %ymm1, %ymm1
	vinsertf64x4	$1, %ymm1, %zmm0, %zmm0
;}}
	retq
	nopl	(%rax)
;}

I want a single masked move, not 3 moves, a vinsertf128, and a vinsertf64x4. That is still much better than even the non-masked move using llvmcall.

MArrays are far easier to optimize, but don’t get along with autodiff libraries like Zygote.