Is there a way to call Julia methods with llvmcall
in a way transparent to the optimizer, so that these methods can be inlined, etc? Defining
julia> foo(a,b,c) = sin(a * b + c)
foo (generic function with 1 method)
Each method gets a mangled name:
; julia> @code_llvm debuginfo=:none foo(1,2,3)
; Function Attrs: uwtable
define double @julia_foo_17134(i64, i64, i64) #0 {
top:
%3 = mul i64 %1, %0
%4 = add i64 %3, %2
%5 = sitofp i64 %4 to double
%6 = call double @julia_sin_17105(double %5)
ret double %6
}
; julia> @code_llvm debuginfo=:none foo(1.,2.,3.)
; Function Attrs: uwtable
define double @julia_foo_17142(double, double, double) #0 {
top:
%3 = fmul double %0, %1
%4 = fadd double %3, %2
%5 = call double @julia_sin_17105(double %4)
ret double %5
}
My motivation is for the sake of a “hack”, so maybe there is a better approach.
I want to be able to declare that sets of pointers and those based on them don’t alias with one another, something similar to C
’s restrict
.
llvm
lets you do this by marking either a function argument, or a return value, as noalias
. I defined:
@generated function noalias!(ptr::Ptr{T}) where {T}
ptyp = llvmtype(Int)
typ = llvmtype(T)
decls = "define noalias $typ* @noalias($typ *%a) noinline { ret $typ* %a }"
instrs = [
"%ptr = inttoptr $ptyp %0 to $typ*",
"%naptr = call $typ* @noalias($typ* %ptr)",
"%jptr = ptrtoint $typ* %naptr to $ptyp",
"ret $ptyp %jptr"
]
quote
$(Expr(:meta,:inline))
Base.llvmcall(
$((decls, join(instrs, "\n"))),
Ptr{$T}, Tuple{Ptr{$T}}, ptr
)
end
end
which works in my motivating example. This function calculates a dot product of two vectors of 16 Float64
in a really dumb way: it takes the elementwise product of the two vectors, storing the results into a third vector. Then, finally, it sums up the values in the third vector.
The goal is for the compiler to elide all the stores, and instead generate code for a regular dot product.
using PaddedMatrices
using SIMDPirates: noalias!, lifetime_start!, lifetime_end!
function test!(a,b,c)
ptrana = noalias!(pointer(a))
ptrb = pointer(b)
ptrc = pointer(c)
ptra = ptrana
lifetime_start!(ptra, Val(128))
for _ ∈ 1:4
vb = vload(Vec{4,Float64}, ptrb)
vc = vload(Vec{4,Float64}, ptrc)
vstore!(ptra, vmul(vb, vc))
ptra += 32
ptrb += 32
ptrc += 32
end
ptra = ptrana
out = vload(Vec{4,Float64}, ptra)
for _ ∈ 1:3
ptra += 32
out = vadd(out, vload(Vec{4,Float64}, ptra))
end
lifetime_end!(ptrana, Val(128))
vsum(out)
end
a = FixedSizeVector{16,Float64,16}(undef); fill!(a, 999.9);
b = @Mutable rand(16);
c = @Mutable rand(16);
The lifetime_start!
and lifetime_end!
functions say that the values within L*sizeof(T)
bytes of the Ptr{T}
argument are undefined before the start and after the end. Because the function doesn’t define the contents of a
, writing to a
is optional. If a
is preallocated memory our program is using to save on allocations, we probably don’t actually care about the contents of a
.
This works as intended:
julia> b' * c
3.254541309302497
julia> a'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9
julia> test!(a, b, c)
3.254541309302497
julia> a'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9 999.9
There were no stores into a
. The associated llvm also shows no stores:
; julia> @code_llvm debuginfo=:none raw=true test!(a, b, c)
define double @"julia_test!_17580"(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128)) !dbg !5 {
top:
%3 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*, !dbg !7
%4 = addrspacecast %jl_value_t addrspace(11)* %3 to %jl_value_t*
%ptr.i = bitcast %jl_value_t* %4 to double*, !dbg !14
%naptr.i = call double* @noalias(double* %ptr.i), !dbg !14
%naptr.i.ptr = bitcast double* %naptr.i to i8*
%5 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*, !dbg !19
%6 = addrspacecast %jl_value_t addrspace(11)* %5 to %jl_value_t*
%7 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*, !dbg !22
%8 = addrspacecast %jl_value_t addrspace(11)* %7 to %jl_value_t*
call void @llvm.lifetime.start.p0i8(i64 1024, i8* %naptr.i.ptr), !dbg !25
%ptr.i23 = bitcast %jl_value_t* %6 to <4 x double>*, !dbg !29
%res.i24 = load <4 x double>, <4 x double>* %ptr.i23, align 8, !dbg !29
%ptr.i21 = bitcast %jl_value_t* %8 to <4 x double>*, !dbg !34
%res.i22 = load <4 x double>, <4 x double>* %ptr.i21, align 8, !dbg !34
%res.i20 = fmul fast <4 x double> %res.i22, %res.i24, !dbg !38
%9 = bitcast %jl_value_t* %6 to i8*, !dbg !51
%10 = getelementptr i8, i8* %9, i64 32, !dbg !51
%11 = bitcast %jl_value_t* %8 to i8*, !dbg !54
%12 = getelementptr i8, i8* %11, i64 32, !dbg !54
%ptr.i23.1 = bitcast i8* %10 to <4 x double>*, !dbg !29
%res.i24.1 = load <4 x double>, <4 x double>* %ptr.i23.1, align 8, !dbg !29
%ptr.i21.1 = bitcast i8* %12 to <4 x double>*, !dbg !34
%res.i22.1 = load <4 x double>, <4 x double>* %ptr.i21.1, align 8, !dbg !34
%res.i20.1 = fmul fast <4 x double> %res.i22.1, %res.i24.1, !dbg !38
%13 = getelementptr i8, i8* %9, i64 64, !dbg !51
%14 = getelementptr i8, i8* %11, i64 64, !dbg !54
%ptr.i23.2 = bitcast i8* %13 to <4 x double>*, !dbg !29
%res.i24.2 = load <4 x double>, <4 x double>* %ptr.i23.2, align 8, !dbg !29
%ptr.i21.2 = bitcast i8* %14 to <4 x double>*, !dbg !34
%res.i22.2 = load <4 x double>, <4 x double>* %ptr.i21.2, align 8, !dbg !34
%res.i20.2 = fmul fast <4 x double> %res.i22.2, %res.i24.2, !dbg !38
%15 = getelementptr i8, i8* %9, i64 96, !dbg !51
%16 = getelementptr i8, i8* %11, i64 96, !dbg !54
%ptr.i23.3 = bitcast i8* %15 to <4 x double>*, !dbg !29
%res.i24.3 = load <4 x double>, <4 x double>* %ptr.i23.3, align 8, !dbg !29
%ptr.i21.3 = bitcast i8* %16 to <4 x double>*, !dbg !34
%res.i22.3 = load <4 x double>, <4 x double>* %ptr.i21.3, align 8, !dbg !34
%res.i20.3 = fmul fast <4 x double> %res.i22.3, %res.i24.3, !dbg !38
%res.i14 = fadd fast <4 x double> %res.i20.1, %res.i20, !dbg !56
%res.i14.1 = fadd fast <4 x double> %res.i20.2, %res.i14, !dbg !56
%res.i14.2 = fadd fast <4 x double> %res.i20.3, %res.i14.1, !dbg !56
call void @llvm.lifetime.end.p0i8(i64 1024, i8* %naptr.i.ptr), !dbg !63
%vec_2_1.i = shufflevector <4 x double> %res.i14.2, <4 x double> undef, <2 x i32> <i32 0, i32 1>, !dbg !67
%vec_2_2.i = shufflevector <4 x double> %res.i14.2, <4 x double> undef, <2 x i32> <i32 2, i32 3>, !dbg !67
%vec_2.i = fadd <2 x double> %vec_2_1.i, %vec_2_2.i, !dbg !67
%vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer, !dbg !67
%vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>, !dbg !67
%vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i, !dbg !67
%res.i = extractelement <1 x double> %vec_1.i, i32 0, !dbg !67
ret double %res.i, !dbg !72
}
similar story for the asm, however it shows we still have the noop
call to noalias:
# julia> @code_native debuginfo=:none test!(a, b, c)
.text
pushq %r14
pushq %rbx
pushq %rax
movq %rdx, %rbx
movq %rsi, %r14
movabsq $noalias, %rax
callq *%rax
vmovupd (%rbx), %ymm0
vmovupd 32(%rbx), %ymm1
vmovupd 64(%rbx), %ymm2
vmovupd 96(%rbx), %ymm3
vmulpd (%r14), %ymm0, %ymm0
vfmadd231pd 32(%r14), %ymm1, %ymm0 # ymm0 = (ymm1 * mem) + ymm0
vfmadd231pd 64(%r14), %ymm2, %ymm0 # ymm0 = (ymm2 * mem) + ymm0
vfmadd231pd 96(%r14), %ymm3, %ymm0 # ymm0 = (ymm3 * mem) + ymm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %xmm1, %xmm0, %xmm0
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vaddsd %xmm1, %xmm0, %xmm0
addq $8, %rsp
popq %rbx
popq %r14
vzeroupper
retq
nop
noalias
is currently declared noinline
. If it is instead declared always inline:
@generated function noalias_inline!(ptr::Ptr{T}) where {T}
ptyp = llvmtype(Int)
typ = llvmtype(T)
decls = "define noalias $typ* @noalias($typ *%a) alwaysinline { ret $typ* %a }"
instrs = [
"%ptr = inttoptr $ptyp %0 to $typ*",
"%naptr = call $typ* @noalias($typ* %ptr)",
"%jptr = ptrtoint $typ* %naptr to $ptyp",
"ret $ptyp %jptr"
]
quote
$(Expr(:meta,:inline))
Base.llvmcall(
$((decls, join(instrs, "\n"))),
Ptr{$T}, Tuple{Ptr{$T}}, ptr
)
end
end
We lose the aliasing information, so for correctness llvm
cannot elide the first three stores (the ones followed by a subsequent load):
julia> test_inline!(a, b, c) # calls noalias_inline! instead
3.254541309302497
julia> a'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
0.0890577 0.00165566 0.0539506 0.030376 0.410241 0.2834 0.171888 0.392196 0.0487064 0.109759 0.241781 0.738396 999.9 999.9 999.9 999.9
Meaning that the price of the noalias information is currently that of a noninlined call. If the call is inlined, the information is lost.
I would rather get that information for free if I can.
My hack solution was that when I wanted to declare one or more arguments noalias
, to make the Julia function @inline
, and then wrap it with an llvm
function (using llvmcall
) that does declare those arguments as noalias
. Using generated functions it should also be method-generic, although it probably wont automatically recompile when the wrapped function gets redefined.
Is there some other way to get the behavior I desire?
@foobar_lv2 Tagging because of your general interest and knowledge about this sort of thing, plus the fact that your recent thread on global const arrays suggests you may have specific interest in optimizing the use of preallocated memory which doesn’t alias your other function arguments.