Sadly, I think that is the best I can do.
I was hoping I could use alloca
to get a pointer, store the variable there, and finally load it – and have llvm eliminate the allocation. However, that does not seem to be the case:
using StaticArrays
a = @SVector randn(24);
@inline function llvmzload(A::SArray{Tuple{24}, Float64, 1, 24}, i::Int)
#= REPL[150]:29 =#
Base.llvmcall(("",
"""%sptr = alloca { [24 x double] }
store { [24 x double] } %0, { [24 x double] }* %sptr
%elptr = getelementptr { [24 x double] }, { [24 x double] }* %sptr, i32 0, i32 0, i64 %1
%ptr = bitcast double* %elptr to <8 x double>*
%res = load <8 x double>, <8 x double>* %ptr, align 8
ret <8 x double> %res"""), Vec{8, Float64}, Tuple{SArray{Tuple{24}, Float64, 1, 24}, Int}, A, i)
end
yields:
julia> @code_native llvmzload(a, 7)
.text
; Function llvmzload {
; Location: REPL[162]:3
subq $64, %rsp
vmovups (%rdi), %zmm0
vmovups 64(%rdi), %zmm1
vmovups 128(%rdi), %zmm2
vmovups %zmm0, -128(%rsp)
vmovups %zmm1, -64(%rsp)
vmovups %zmm2, (%rsp)
vmovups -128(%rsp,%rsi,8), %zmm0
addq $64, %rsp
retq
nopl (%rax)
;}
I was basing this off a 10 year old email that said llvm has no address-of operator, unlike C (& Co):
You probably already know this, but I want to state it just in case.
LLVM has no address-of operator, so if you see an instruction like “%sum
= add i32 %x, %y” you can’t get the address of %sum (it’s a register,
after all).The C frontend doesn’t interpret “int32_t i;” as creating an LLVM
variable of type i32; it actually does “i32* %i_addr = alloca i32”.
LLVM’s optimizers will remove the alloca and lower it to registers if
nobody takes its address.
So I was hoping it would eliminate the moves to %rsp
, and just do a single move from %rdi
.
If I try to use @asmcall instead, could that function still be inlined into the calling functions?
The problem with those simple tuple functions is that I’ve never seen the compiler use a mask instruction.
julia> m2 = (true, true, true, true, true, true, true, false)
(true, true, true, true, true, true, true, false)
julia> @inline function vload2(::Type{Vec{N,T}}, x, ::Val{mask}) where {N,T,mask}
ntuple(n -> mask[n] ? VE(T(x[n])) : VE(zero(T)), Val(N))
end
vload2 (generic function with 2 methods)
julia> vload2(Vec{8,Float64}, a, Val(m2))
(VecElement{Float64}(0.1341517372061226), VecElement{Float64}(-1.4478062620335768), VecElement{Float64}(-0.830543624514204), VecElement{Float64}(0.9402472351085811), VecElement{Float64}(-0.5112288316089226), VecElement{Float64}(0.19750599041128963), VecElement{Float64}(0.9528499075643893), VecElement{Float64}(0.0))
julia> @code_native vload2(Vec{8,Float64}, a, Val(m2))
.text
; Function vload2 {
; Location: REPL[165]:1
; Function ntuple; {
; Location: sysimg.jl:271
; Function macro expansion; {
; Location: REPL[165]:1
vmovups (%rsi), %ymm0
vmovups 32(%rsi), %xmm1
vmovsd 48(%rsi), %xmm2 # xmm2 = mem[0],zero
vinsertf128 $1, %xmm2, %ymm1, %ymm1
vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
;}}
retq
nopl (%rax)
;}
I want a single masked move, not 3 moves, a vinsertf128, and a vinsertf64x4. That is still much better than even the non-masked move using llvmcall
.
MArrays are far easier to optimize, but don’t get along with autodiff libraries like Zygote.