Help defining (masked) vload and vstore operations for SArrays (or other isbits structs) using llvmcall

Elrod · November 8, 2018, 8:34pm

Sadly, I think that is the best I can do.
I was hoping I could use alloca to get a pointer, store the variable there, and finally load it – and have llvm eliminate the allocation. However, that does not seem to be the case:

using StaticArrays
a = @SVector randn(24);
@inline function llvmzload(A::SArray{Tuple{24}, Float64, 1, 24}, i::Int)
    #= REPL[150]:29 =#
    Base.llvmcall(("",
    """%sptr = alloca { [24 x double] }
    store { [24 x double] } %0, { [24 x double] }* %sptr
    %elptr = getelementptr { [24 x double] }, { [24 x double] }* %sptr, i32 0, i32 0, i64 %1
    %ptr = bitcast double* %elptr to <8 x double>*
    %res = load <8 x double>, <8 x double>* %ptr, align 8
    ret <8 x double> %res"""), Vec{8, Float64}, Tuple{SArray{Tuple{24}, Float64, 1, 24}, Int}, A, i)
end

yields:

julia> @code_native llvmzload(a, 7)
	.text
; Function llvmzload {
; Location: REPL[162]:3
	subq	$64, %rsp
	vmovups	(%rdi), %zmm0
	vmovups	64(%rdi), %zmm1
	vmovups	128(%rdi), %zmm2
	vmovups	%zmm0, -128(%rsp)
	vmovups	%zmm1, -64(%rsp)
	vmovups	%zmm2, (%rsp)
	vmovups	-128(%rsp,%rsi,8), %zmm0
	addq	$64, %rsp
	retq
	nopl	(%rax)
;}

I was basing this off a 10 year old email that said llvm has no address-of operator, unlike C (& Co):

You probably already know this, but I want to state it just in case.
LLVM has no address-of operator, so if you see an instruction like “%sum
= add i32 %x, %y” you can’t get the address of %sum (it’s a register,
after all).

The C frontend doesn’t interpret “int32_t i;” as creating an LLVM
variable of type i32; it actually does “i32* %i_addr = alloca i32”.
LLVM’s optimizers will remove the alloca and lower it to registers if
nobody takes its address.

So I was hoping it would eliminate the moves to %rsp, and just do a single move from %rdi.
If I try to use @asmcall instead, could that function still be inlined into the calling functions?

The problem with those simple tuple functions is that I’ve never seen the compiler use a mask instruction.

julia> m2 = (true, true, true, true, true, true, true, false)
(true, true, true, true, true, true, true, false)

julia> @inline function vload2(::Type{Vec{N,T}}, x, ::Val{mask}) where {N,T,mask}
           ntuple(n -> mask[n] ? VE(T(x[n])) : VE(zero(T)), Val(N))
       end
vload2 (generic function with 2 methods)

julia> vload2(Vec{8,Float64}, a, Val(m2))
(VecElement{Float64}(0.1341517372061226), VecElement{Float64}(-1.4478062620335768), VecElement{Float64}(-0.830543624514204), VecElement{Float64}(0.9402472351085811), VecElement{Float64}(-0.5112288316089226), VecElement{Float64}(0.19750599041128963), VecElement{Float64}(0.9528499075643893), VecElement{Float64}(0.0))

julia> @code_native vload2(Vec{8,Float64}, a, Val(m2))
	.text
; Function vload2 {
; Location: REPL[165]:1
; Function ntuple; {
; Location: sysimg.jl:271
; Function macro expansion; {
; Location: REPL[165]:1
	vmovups	(%rsi), %ymm0
	vmovups	32(%rsi), %xmm1
	vmovsd	48(%rsi), %xmm2         # xmm2 = mem[0],zero
	vinsertf128	$1, %xmm2, %ymm1, %ymm1
	vinsertf64x4	$1, %ymm1, %zmm0, %zmm0
;}}
	retq
	nopl	(%rax)
;}

I want a single masked move, not 3 moves, a vinsertf128, and a vinsertf64x4. That is still much better than even the non-masked move using llvmcall.

MArrays are far easier to optimize, but don’t get along with autodiff libraries like Zygote.

Topic		Replies	Views
Challenge: Can you beat Python and C++ in Int4 Matrix-Vector Multiply Op? Performance bitpacking , llm , quantization , integer	10	1472	July 25, 2023
Matrix of SVectors allocating Performance staticarrays	7	279	August 2, 2024
Performance details of StaticArray Performance	5	3081	June 7, 2018
Sum performance for Array{Float64,2} elements Performance	13	2444	May 15, 2018
Alternate BLAS libraries? General Usage blas	22	2856	July 4, 2020

Help defining (masked) vload and vstore operations for SArrays (or other isbits structs) using llvmcall

Related topics