Can you call Julia methods with LLVM call?

Is there a way to call Julia methods with llvmcall in a way transparent to the optimizer, so that these methods can be inlined, etc? Defining

julia> foo(a,b,c) = sin(a * b + c)
foo (generic function with 1 method)

Each method gets a mangled name:

; julia> @code_llvm debuginfo=:none foo(1,2,3)

; Function Attrs: uwtable
define double @julia_foo_17134(i64, i64, i64) #0 {
top:
  %3 = mul i64 %1, %0
  %4 = add i64 %3, %2
  %5 = sitofp i64 %4 to double
  %6 = call double @julia_sin_17105(double %5)
  ret double %6
}

; julia> @code_llvm debuginfo=:none foo(1.,2.,3.)

; Function Attrs: uwtable
define double @julia_foo_17142(double, double, double) #0 {
top:
  %3 = fmul double %0, %1
  %4 = fadd double %3, %2
  %5 = call double @julia_sin_17105(double %4)
  ret double %5
}

My motivation is for the sake of a “hack”, so maybe there is a better approach.
I want to be able to declare that sets of pointers and those based on them don’t alias with one another, something similar to C's restrict.
llvm lets you do this by marking either a function argument, or a return value, as noalias. I defined:

@generated function noalias!(ptr::Ptr{T}) where {T}
    ptyp = llvmtype(Int)
    typ = llvmtype(T)
    decls = "define noalias $typ* @noalias($typ *%a) noinline { ret $typ* %a }"
    instrs = [
        "%ptr = inttoptr $ptyp %0 to $typ*",
        "%naptr = call $typ* @noalias($typ* %ptr)",
        "%jptr = ptrtoint $typ* %naptr to $ptyp",
        "ret $ptyp %jptr"
    ]
    quote
        $(Expr(:meta,:inline))
        Base.llvmcall(
            $((decls, join(instrs, "\n"))),
            Ptr{$T}, Tuple{Ptr{$T}}, ptr
        )
    end    
end

which works in my motivating example. This function calculates a dot product of two vectors of 16 Float64 in a really dumb way: it takes the elementwise product of the two vectors, storing the results into a third vector. Then, finally, it sums up the values in the third vector.
The goal is for the compiler to elide all the stores, and instead generate code for a regular dot product.

using PaddedMatrices
using SIMDPirates: noalias!, lifetime_start!, lifetime_end!

function test!(a,b,c)
    ptrana = noalias!(pointer(a))
    ptrb = pointer(b)
    ptrc = pointer(c)
    ptra = ptrana
    lifetime_start!(ptra, Val(128))
    for _ ∈ 1:4
        vb = vload(Vec{4,Float64}, ptrb)
        vc = vload(Vec{4,Float64}, ptrc)
        vstore!(ptra, vmul(vb, vc))
        ptra += 32
        ptrb += 32
        ptrc += 32
    end
    ptra = ptrana
    out = vload(Vec{4,Float64}, ptra)
    for _ ∈ 1:3
        ptra += 32
        out = vadd(out, vload(Vec{4,Float64}, ptra))
    end
    lifetime_end!(ptrana, Val(128))
    vsum(out)
end

a = FixedSizeVector{16,Float64,16}(undef); fill!(a, 999.9);
b = @Mutable rand(16);
c = @Mutable rand(16);

The lifetime_start! and lifetime_end! functions say that the values within L*sizeof(T) bytes of the Ptr{T} argument are undefined before the start and after the end. Because the function doesn’t define the contents of a, writing to a is optional. If a is preallocated memory our program is using to save on allocations, we probably don’t actually care about the contents of a.

This works as intended:

julia> b' * c
3.254541309302497

julia> a'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
 999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9

julia> test!(a, b, c)
3.254541309302497

julia> a'
1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
 999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9  999.9

There were no stores into a. The associated llvm also shows no stores:

; julia> @code_llvm debuginfo=:none raw=true test!(a, b, c)

define double @"julia_test!_17580"(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128)) !dbg !5 {
top:
  %3 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*, !dbg !7
  %4 = addrspacecast %jl_value_t addrspace(11)* %3 to %jl_value_t*
  %ptr.i = bitcast %jl_value_t* %4 to double*, !dbg !14
  %naptr.i = call double* @noalias(double* %ptr.i), !dbg !14
  %naptr.i.ptr = bitcast double* %naptr.i to i8*
  %5 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*, !dbg !19
  %6 = addrspacecast %jl_value_t addrspace(11)* %5 to %jl_value_t*
  %7 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*, !dbg !22
  %8 = addrspacecast %jl_value_t addrspace(11)* %7 to %jl_value_t*
  call void @llvm.lifetime.start.p0i8(i64 1024, i8* %naptr.i.ptr), !dbg !25
  %ptr.i23 = bitcast %jl_value_t* %6 to <4 x double>*, !dbg !29
  %res.i24 = load <4 x double>, <4 x double>* %ptr.i23, align 8, !dbg !29
  %ptr.i21 = bitcast %jl_value_t* %8 to <4 x double>*, !dbg !34
  %res.i22 = load <4 x double>, <4 x double>* %ptr.i21, align 8, !dbg !34
  %res.i20 = fmul fast <4 x double> %res.i22, %res.i24, !dbg !38
  %9 = bitcast %jl_value_t* %6 to i8*, !dbg !51
  %10 = getelementptr i8, i8* %9, i64 32, !dbg !51
  %11 = bitcast %jl_value_t* %8 to i8*, !dbg !54
  %12 = getelementptr i8, i8* %11, i64 32, !dbg !54
  %ptr.i23.1 = bitcast i8* %10 to <4 x double>*, !dbg !29
  %res.i24.1 = load <4 x double>, <4 x double>* %ptr.i23.1, align 8, !dbg !29
  %ptr.i21.1 = bitcast i8* %12 to <4 x double>*, !dbg !34
  %res.i22.1 = load <4 x double>, <4 x double>* %ptr.i21.1, align 8, !dbg !34
  %res.i20.1 = fmul fast <4 x double> %res.i22.1, %res.i24.1, !dbg !38
  %13 = getelementptr i8, i8* %9, i64 64, !dbg !51
  %14 = getelementptr i8, i8* %11, i64 64, !dbg !54
  %ptr.i23.2 = bitcast i8* %13 to <4 x double>*, !dbg !29
  %res.i24.2 = load <4 x double>, <4 x double>* %ptr.i23.2, align 8, !dbg !29
  %ptr.i21.2 = bitcast i8* %14 to <4 x double>*, !dbg !34
  %res.i22.2 = load <4 x double>, <4 x double>* %ptr.i21.2, align 8, !dbg !34
  %res.i20.2 = fmul fast <4 x double> %res.i22.2, %res.i24.2, !dbg !38
  %15 = getelementptr i8, i8* %9, i64 96, !dbg !51
  %16 = getelementptr i8, i8* %11, i64 96, !dbg !54
  %ptr.i23.3 = bitcast i8* %15 to <4 x double>*, !dbg !29
  %res.i24.3 = load <4 x double>, <4 x double>* %ptr.i23.3, align 8, !dbg !29
  %ptr.i21.3 = bitcast i8* %16 to <4 x double>*, !dbg !34
  %res.i22.3 = load <4 x double>, <4 x double>* %ptr.i21.3, align 8, !dbg !34
  %res.i20.3 = fmul fast <4 x double> %res.i22.3, %res.i24.3, !dbg !38
  %res.i14 = fadd fast <4 x double> %res.i20.1, %res.i20, !dbg !56
  %res.i14.1 = fadd fast <4 x double> %res.i20.2, %res.i14, !dbg !56
  %res.i14.2 = fadd fast <4 x double> %res.i20.3, %res.i14.1, !dbg !56
  call void @llvm.lifetime.end.p0i8(i64 1024, i8* %naptr.i.ptr), !dbg !63
  %vec_2_1.i = shufflevector <4 x double> %res.i14.2, <4 x double> undef, <2 x i32> <i32 0, i32 1>, !dbg !67
  %vec_2_2.i = shufflevector <4 x double> %res.i14.2, <4 x double> undef, <2 x i32> <i32 2, i32 3>, !dbg !67
  %vec_2.i = fadd <2 x double> %vec_2_1.i, %vec_2_2.i, !dbg !67
  %vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer, !dbg !67
  %vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>, !dbg !67
  %vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i, !dbg !67
  %res.i = extractelement <1 x double> %vec_1.i, i32 0, !dbg !67
  ret double %res.i, !dbg !72
}

similar story for the asm, however it shows we still have the noop call to noalias:

 # julia> @code_native debuginfo=:none test!(a, b, c)
         .text
         pushq   %r14
         pushq   %rbx
         pushq   %rax
         movq    %rdx, %rbx
         movq    %rsi, %r14
         movabsq $noalias, %rax
         callq   *%rax
         vmovupd (%rbx), %ymm0
         vmovupd 32(%rbx), %ymm1
         vmovupd 64(%rbx), %ymm2
         vmovupd 96(%rbx), %ymm3
         vmulpd  (%r14), %ymm0, %ymm0
         vfmadd231pd     32(%r14), %ymm1, %ymm0 # ymm0 = (ymm1 * mem) + ymm0
         vfmadd231pd     64(%r14), %ymm2, %ymm0 # ymm0 = (ymm2 * mem) + ymm0
         vfmadd231pd     96(%r14), %ymm3, %ymm0 # ymm0 = (ymm3 * mem) + ymm0
         vextractf128    $1, %ymm0, %xmm1
         vaddpd  %xmm1, %xmm0, %xmm0
         vpermilpd       $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
         vaddsd  %xmm1, %xmm0, %xmm0
         addq    $8, %rsp
         popq    %rbx
         popq    %r14
         vzeroupper
         retq
         nop

noalias is currently declared noinline. If it is instead declared always inline:

@generated function noalias_inline!(ptr::Ptr{T}) where {T}
    ptyp = llvmtype(Int)
    typ = llvmtype(T)
    decls = "define noalias $typ* @noalias($typ *%a) alwaysinline { ret $typ* %a }"
    instrs = [
        "%ptr = inttoptr $ptyp %0 to $typ*",
        "%naptr = call $typ* @noalias($typ* %ptr)",
        "%jptr = ptrtoint $typ* %naptr to $ptyp",
        "ret $ptyp %jptr"
    ]
    quote
        $(Expr(:meta,:inline))
        Base.llvmcall(
            $((decls, join(instrs, "\n"))),
            Ptr{$T}, Tuple{Ptr{$T}}, ptr
        )
    end    
end

We lose the aliasing information, so for correctness llvm cannot elide the first three stores (the ones followed by a subsequent load):

 julia> test_inline!(a, b, c) # calls noalias_inline! instead
 3.254541309302497
 
 julia> a'
 1×16 LinearAlgebra.Adjoint{Float64,FixedSizeArray{Tuple{16},Float64,1,Tuple{1},16}}:
  0.0890577  0.00165566  0.0539506  0.030376  0.410241  0.2834  0.171888  0.392196  0.0487064  0.109759  0.241781  0.738396  999.9  999.9  999.9  999.9

Meaning that the price of the noalias information is currently that of a noninlined call. If the call is inlined, the information is lost.

I would rather get that information for free if I can.

My hack solution was that when I wanted to declare one or more arguments noalias, to make the Julia function @inline, and then wrap it with an llvm function (using llvmcall) that does declare those arguments as noalias. Using generated functions it should also be method-generic, although it probably wont automatically recompile when the wrapped function gets redefined.

Is there some other way to get the behavior I desire?

@foobar_lv2 Tagging because of your general interest and knowledge about this sort of thing, plus the fact that your recent thread on global const arrays suggests you may have specific interest in optimizing the use of preallocated memory which doesn’t alias your other function arguments.

1 Like

To answer your direct question you can mark a Julia function as a ccallable which exports a symbol into the global namespace that you can call. You will need to declare it and mark it as always inline to actually get the behaviour you want.

To answer the question you seem to want to get at in the end is that you want to have something similar to restrict. I would be interested in seeing if @aliasscopes from https://github.com/JuliaLang/julia/pull/31489 is sufficient to achieve what you wanted.
Fair warning it is an experimental API and we would like to have a better implementation.

2 Likes

I just noticed that @ccallable is not really documented https://github.com/JuliaLang/julia/blob/77a4d06bf2283848e7a471731fef47f8e6275c8c/base/c.jl#L466-L505

Base.@ccallable Int function myfun(x::Int)
           x + 1
end

f(x) = Base.llvmcall(
       (""" declare i64 @myfun(i64) """,
        """
           %2 = call i64 @myfun(i64 %0)
           ret i64 %2
       """), Int, Tuple{Int}, x)

julia> f(1)
2

Here is an example on how to use LLVM.jl to call an intrinsic

and you can do the same thing to call a ccallable and https://github.com/maleadt/LLVM.jl/blob/792edc1fe9471b95bad87c619d25c87befe92a60/src/interop/base.jl#L34 to mark something as always inlinable.

1 Like

Naively, I tried (following your example):

julia> Base.@ccallable Int function myfun(x::Int)
                  x + 1
       end

julia> f(x) = Base.llvmcall(
              (""" declare i64 @myfun(i64) alwaysinline """,
               """
                  %2 = call i64 @myfun(i64 %0)
                  ret i64 %2
              """), Int, Tuple{Int}, x)
f (generic function with 1 method)

julia> f(5)
6

julia> @code_llvm f(5)

;  @ REPL[2]:1 within `f'
; Function Attrs: uwtable
define i64 @julia_f_17091(i64) #0 {
top:
  %1 = call i64 @myfun(i64 %0)
  ret i64 %1
}

I added alwaysinline to the function declaration in the llvmcall, but the ccallable function was not inlined.
The idea was to declare two functions in the declarations: a Julia function, to be inlined, and another function (where the pointer arguments are noalias) that the Julia function gets inlined into, so that it gets compiled with that noalias information. The instructions of the llvmcall then call the wrapping llvm function, calling the Julia function.

That this doesn’t work seems to be unrelated to the @ccallable, as it seems to happen when writing out the mangled Julia name manually as well:

julia> @inline myinlinefunc(x::Int) = x + 1
myinlinefunc (generic function with 1 method)

julia> myinlinefunc(5)
6

julia> @code_llvm myinlinefunc(5)

;  @ REPL[16]:1 within `myinlinefunc'
; Function Attrs: uwtable
define i64 @julia_myinlinefunc_17172(i64) #0 {
top:
; ┌ @ int.jl:53 within `+'
   %1 = add i64 %0, 1
; └
  ret i64 %1
}

julia> f2(x) = Base.llvmcall(
              (""" declare i64 @julia_myinlinefunc_17172(i64) alwaysinline """,
               """
                  %2 = call i64 @julia_myinlinefunc_17172(i64 %0)
                  ret i64 %2
              """), Int, Tuple{Int}, x)
f2 (generic function with 1 method)

julia> f2(5)
6

julia> @code_llvm f2(5)

;  @ REPL[19]:1 within `f2'
; Function Attrs: uwtable
define i64 @julia_f2_17183(i64) #0 {
top:
  %1 = call i64 @julia_myinlinefunc_17172(i64 %0)
  ret i64 %1
}

So Base.@ccallable does do what I wanted, but I can’t get the function to actually inline.

If the Julia function doesn’t get inlined, I wouldn’t expect it to be compiled with the noalias info.

I’ll have to spend more time looking at LLVM.jl and the linked examples to see if it provides a solution. Maybe I didn’t specify the alwaysinline correctly, even though it didn’t error (and it did work when I had a function body, like the @noalias example).

I’m at work now, and building Julia master to try @aliasscopes, although if I run into trouble I’ll wait until I get home.
It looks like it only works for Array{T,N}, and isn’t extensible to AbstractArrays or pointers.
Searching for const_arrayref, I can’t seem to find anything indicating how I could go about implementing a getindex or vload function for custom types with llvmcall.

From looking at the examples, I’d expect it to work. But I have largely been using AbstractArray types that are fat pointers or pointers holding size information in their types (like StaticArrays), pointing to sections of a big block of preallocated memory.

unsafe_wrap is more costly than the noninlined function call from the noalias! approach.

I think you will have to use LLVM.jl to get that attribute set correctly.

Yeah … the issue is to get to the right aliasset and that needs to happen during codegen, and yes it only works for const_arrayref :confused:

This doesn’t seem to be correct:

Base.@ccallable Int function myfun(x::Int)
    x + 1
end
@generated function testjuliainline(y::Int)
    T_int = LLVM.IntType(sizeof(Int)*8, JuliaContext())

    paramtyps = [ T_int ]
    ret_typ = T_int # returning a Ptr{Cvoid}
    llvmf, _ = create_function(ret_typ, paramtyps)

    mod = LLVM.parent(llvmf)
    intrinsic_typ = LLVM.FunctionType(T_int, paramtyps)
    intrinsic = LLVM.Function(mod, "myfun", intrinsic_typ)

    push!(function_attributes(intrinsic), EnumAttribute("alwaysinline", 0, JuliaContext()))
    
    Builder(JuliaContext()) do builder
        entry = BasicBlock(llvmf, "entry", JuliaContext())
        position!(builder, entry)
        val = call!(builder, intrinsic, [parameters(llvmf)[1]])
        ret!(builder, val)
    end

    call_function(llvmf, Int, Tuple{Int}, :(y,))
end
myfun(8)
testjuliainline(3)
@code_llvm testjuliainline(3)
@code_native testjuliainline(3)

It yields:

julia> myfun(8)
9

julia> testjuliainline(3)
4
;; julia> @code_llvm testjuliainline(3)

;  @ REPL[34]:2 within `testjuliainline'
define i64 @julia_testjuliainline_17245(i64) {
top:
; ┌ @ REPL[34]:2 within `macro expansion' @ /home/chriselrod/.julia/packages/LLVM/ICZSf/src/interop/base.jl:52
   %1 = call i64 @myfun(i64 %0)
; │ @ REPL[34]:2 within `macro expansion'
   ret i64 %1
; └
}
# julia> @code_native testjuliainline(3)
        .text
        pushq   %rax
        movabsq $140087071072160, %rax  # imm = 0x7F68901BDFA0
        callq   *%rax
        popq    %rcx
        retq
        nop

An alternative idea that I’d expect to work just as well as noalias is to declare the constant inputs invariant; if LLVM new those stores don’t change the inputs (whether because they don’t alias, or because the inputs are invariant), it’ll be able to elide the stores.

invariant.start and invariant.end have an API that involves returning a token of type {}*, which if I understand correctly, would be a pointer to an empty structure.

This doesn’t seem to have a Julia equivalent, because empty structs (and tuples) seem to be lowered to void. But given that it’s a pointer, I figured I could handle it like any other pointer (ptrtoint and inttoptr back again when using it), since they’re all just Int on the Julia side of things anyway.

However, the parser does not seem to like {}*; effort 1:

using SIMDPirates: llvmtype
struct Invariant{L,T}
    ivp::Ptr{Cvoid}
    ptr::Ptr{T}
end
@generated function invariant_start!(ptr::Ptr{T}, ::Val{L}) where {L,T}
    ptyp = llvmtype(Int)
    decls = "declare {}* @llvm.invariant.start.p0i8(i64, i8* nocapture)"
    instrs = [
        "%ptr = inttoptr $ptyp %0 to i8*",
        "%ivt = call {}* @llvm.invariant.start.p0i8(i64 $(L*sizeof(T)), i8* %ptr)",
        "%ivp = ptrtoint {}* %ivt to $ptyp",
        "ret %ivp"
    ]
    quote
        $(Expr(:meta,:inline))
        ivp = Base.llvmcall(
            $((decls, join(instrs, "\n"))),
            Ptr{Cvoid}, Tuple{Ptr{$T}}, ptr
        )
        Invariant{$(L*sizeof(T)),T}(ivp, ptr)
    end
end
@generated function invariant_end!(ivp::Invariant{L}) where {L}
    ptyp = llvmtype(Int)
    decls = "declare void @llvm.invariant.end.p0i8({}*, i64, i8* nocapture)"
    instrs = [
        "%ivp = inttoptr $ptyp %0 to {}*",
        "%ptr = inttoptr $ptyp %1 to i8*",
        "call void @llvm.lifetime.end.p0i8({}* %ivp, i64 $(L), i8* %ptr)",
        "ret void"
    ]
    quote
        $(Expr(:meta,:inline))
        Base.llvmcall(
            $((decls, join(instrs, "\n"))),
            Cvoid, Tuple{Ptr{$T}}, ivp.ivp, ivp.ptr
        )
    end
end

Yields # just checking if it compiles

julia> x = rand(100);

julia> invariant_start!(pointer(x), Val(100))
ERROR: error compiling invariant_start!: Failed to parse LLVM Assembly: 
julia: llvmcall:9:1: error: expected value token
}
^

Stacktrace:
 [1] top-level scope at REPL[6]:1
caused by [exception 1]
Failed to parse LLVM Assembly: 
julia: llvmcall:9:1: error: expected value token
}
^

Stacktrace:
 [1] top-level scope at REPL[6]:1

Seems like the parser is looking for contents and thus doesn’t expect the close?

So, using LLVM.jl

using LLVM
using LLVM.Interop

using PaddedMatrices
struct Invariant{L,T}
    ivp::Ptr{Cvoid}
    ptr::Ptr{T}
end

@generated function invariant_start!(ptr::Ptr{T}, ::Val{L}) where {L,T}
    T_int = LLVM.IntType(sizeof(Int)*8, JuliaContext())
    # T_int = LLVM.IntType(sizeof(Int)*8, JuliaContext())
    paramtyps = [ T_int ]
    ret_typ = T_int # returning a Ptr{Cvoid}
    llvmf, _ = create_function(ret_typ, paramtyps)
    i8 = convert(LLVMType, Int8)
    i8_ptr = LLVM.PointerType(i8)

    emptystruct = LLVM.StructType(typeof(i8)[])
    emptystruct_ptr = LLVM.PointerType(emptystruct)
    
    mod = LLVM.parent(llvmf)
    intrinsic_typ = LLVM.FunctionType(emptystruct_ptr, [convert(LLVMType, Int), i8_ptr])
    intrinsic = LLVM.Function(mod, "llvm.invariant.start.p0i8", intrinsic_typ)

    Builder(JuliaContext()) do builder
        entry = BasicBlock(llvmf, "entry", JuliaContext())
        position!(builder, entry)
        ptr = inttoptr!(builder, parameters(llvmf)[1], i8_ptr)
        ivt_ptr = call!(builder, intrinsic, [ConstantInt(T_int, sizeof(T)*L), ptr])
        ivt_int = ptrtoint!(builder, ivt_ptr, T_int)
        ret!(builder, ivt_int)
    end

    call_function(llvmf, Ptr{Cvoid}, Tuple{Ptr{T}}, :(ptr,))
end
@generated function invariant_end!(ivt_ptr::Ptr{Cvoid}, ptr::Ptr{T}, ::Val{L}) where {L,T}
    T_int = LLVM.IntType(sizeof(Int)*8, JuliaContext())
    # T_int = LLVM.IntType(sizeof(Int)*8, JuliaContext())
    voidtype = LLVM.VoidType(JuliaContext())
    
    paramtyps = [ T_int, T_int ]
    ret_typ = voidtype # returning a Ptr{Cvoid}
    llvmf, _ = create_function(ret_typ, paramtyps)
    i8 = convert(LLVMType, Int8)
    i8_ptr = LLVM.PointerType(i8)

    emptystruct = LLVM.StructType(typeof(i8)[])
    emptystruct_ptr = LLVM.PointerType(emptystruct)

    mod = LLVM.parent(llvmf)
    intrinsic_typ = LLVM.FunctionType(voidtype, [emptystruct_ptr, convert(LLVMType, Int), i8_ptr])
    intrinsic = LLVM.Function(mod, "llvm.invariant.end.p0i8", intrinsic_typ)

    Builder(JuliaContext()) do builder
        entry = BasicBlock(llvmf, "entry", JuliaContext())
        position!(builder, entry)
        invt_ptr = inttoptr!(builder, parameters(llvmf)[1], emptystruct_ptr)
        ptr = inttoptr!(builder, parameters(llvmf)[2], i8_ptr)
        ret = call!(builder, intrinsic, [invt_ptr, ConstantInt(T_int, sizeof(T)*L), ptr])
        ret!(builder)
    end

    call_function(llvmf, Cvoid, Tuple{Ptr{Cvoid},Ptr{T}}, :(ivt_ptr, ptr))
end
using PaddedMatrices: AbstractMutableFixedSizeArray
@inline function freeze!(A::AbstractMutableFixedSizeArray{S,T,N,X,L}) where {S,T,N,X,L}
    A_ptr = pointer(A)
    Invariant{L,T}(invariant_start!(A_ptr, Val(L)), A_ptr)
end
@inline function melt!(ivt::Invariant{L,T}) where {L,T}
    invariant_end!(ivt.ivp, ivt.ptr, Val(L))
end


using SIMDPirates
using SIMDPirates: noalias!, lifetime_start!, lifetime_end!

function test!(a,b,c)
    # ptrana = noalias!(pointer(a))
    ptrana = pointer(a)
    freezeb = freeze!(b)
    freezec = freeze!(c)
    ptrb = pointer(b)
    ptrc = pointer(c)
    ptra = ptrana
    lifetime_start!(ptra, Val(128))
    for _ ∈ 1:4
        vb = vload(Vec{4,Float64}, ptrb)
        vc = vload(Vec{4,Float64}, ptrc)
        vstore!(ptra, vmul(vb, vc))
        ptra += 32
        ptrb += 32
        ptrc += 32
    end
    ptra = ptrana
    out = vload(Vec{4,Float64}, ptra)
    for _ ∈ 1:3
        ptra += 32
        out = vadd(out, vload(Vec{4,Float64}, ptra))
    end
    lifetime_end!(ptrana, Val(128))
    melt!(freezeb)
    melt!(freezec)
    vsum(out)
end

a = FixedSizeVector{16,Float64,16}(undef);
fill!(a, 999.9);
b = @Mutable rand(16);
c = @Mutable rand(16);
b' * c
test!(a, b, c)
a'

Unfortunately, the invariance didn’t seem to work:

;julia> @code_llvm debuginfo=:none test!(a, b, c)

define double @"julia_test!_17920"(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128)) {
top:
  %3 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %4 = addrspacecast %jl_value_t addrspace(11)* %3 to %jl_value_t*
  %.ptr = bitcast %jl_value_t* %4 to i8*
  %5 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
  %6 = addrspacecast %jl_value_t addrspace(11)* %5 to %jl_value_t*
  %7 = bitcast %jl_value_t* %6 to i8*
  %8 = call {}* @llvm.invariant.start.p0i8(i64 128, i8* %7)
  %9 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
  %10 = addrspacecast %jl_value_t addrspace(11)* %9 to %jl_value_t*
  %11 = bitcast %jl_value_t* %10 to i8*
  %12 = call {}* @llvm.invariant.start.p0i8(i64 128, i8* %11)
  call void @llvm.lifetime.start.p0i8(i64 1024, i8* %.ptr)
  %ptr.i23 = bitcast %jl_value_t* %6 to <4 x double>*
  %res.i24 = load <4 x double>, <4 x double>* %ptr.i23, align 8
  %ptr.i21 = bitcast %jl_value_t* %10 to <4 x double>*
  %res.i22 = load <4 x double>, <4 x double>* %ptr.i21, align 8
  %res.i20 = fmul fast <4 x double> %res.i22, %res.i24
  %ptr.i19 = bitcast %jl_value_t* %4 to <4 x double>*
  store <4 x double> %res.i20, <4 x double>* %ptr.i19, align 8
  %13 = getelementptr i8, i8* %.ptr, i64 32
  %14 = getelementptr i8, i8* %7, i64 32
  %15 = getelementptr i8, i8* %11, i64 32
  %ptr.i23.1 = bitcast i8* %14 to <4 x double>*
  %res.i24.1 = load <4 x double>, <4 x double>* %ptr.i23.1, align 8
  %ptr.i21.1 = bitcast i8* %15 to <4 x double>*
  %res.i22.1 = load <4 x double>, <4 x double>* %ptr.i21.1, align 8
  %res.i20.1 = fmul fast <4 x double> %res.i22.1, %res.i24.1
  %ptr.i19.1 = bitcast i8* %13 to <4 x double>*
  store <4 x double> %res.i20.1, <4 x double>* %ptr.i19.1, align 8
  %16 = getelementptr i8, i8* %.ptr, i64 64
  %17 = getelementptr i8, i8* %7, i64 64
  %18 = getelementptr i8, i8* %11, i64 64
  %ptr.i23.2 = bitcast i8* %17 to <4 x double>*
  %res.i24.2 = load <4 x double>, <4 x double>* %ptr.i23.2, align 8
  %ptr.i21.2 = bitcast i8* %18 to <4 x double>*
  %res.i22.2 = load <4 x double>, <4 x double>* %ptr.i21.2, align 8
  %res.i20.2 = fmul fast <4 x double> %res.i22.2, %res.i24.2
  %ptr.i19.2 = bitcast i8* %16 to <4 x double>*
  store <4 x double> %res.i20.2, <4 x double>* %ptr.i19.2, align 8
  %19 = getelementptr i8, i8* %7, i64 96
  %20 = getelementptr i8, i8* %11, i64 96
  %ptr.i23.3 = bitcast i8* %19 to <4 x double>*
  %res.i24.3 = load <4 x double>, <4 x double>* %ptr.i23.3, align 8
  %ptr.i21.3 = bitcast i8* %20 to <4 x double>*
  %res.i22.3 = load <4 x double>, <4 x double>* %ptr.i21.3, align 8
  %res.i20.3 = fmul fast <4 x double> %res.i22.3, %res.i24.3
  %res.i14 = fadd fast <4 x double> %res.i20.1, %res.i20
  %res.i14.1 = fadd fast <4 x double> %res.i20.2, %res.i14
  %res.i14.2 = fadd fast <4 x double> %res.i20.3, %res.i14.1
  call void @llvm.lifetime.end.p0i8(i64 1024, i8* %.ptr)
  call void @llvm.invariant.end.p0i8({}* %8, i64 128, i8* %7)
  call void @llvm.invariant.end.p0i8({}* %12, i64 128, i8* %11)
  %res.i = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double 0.000000e+00, <4 x double> %res.i14.2)
  ret double %res.i
}

We have %8 and %12 equaling invariant starts, matched with respective invariant ends at the end of the function.
However, we still have three stores sandwitched between the starts and ends. :frowning:
I would have thought this expresses the same information as the no-aliasing, as what else is the problem with aliasing, other than mutating the contents behind pointers b and c?

Anything apparent that I did wrong?
On either using LLVM to get a @ccallable Julia function to inline so that alias information is available while compiling, or getting LLVM to use the invariance information to elide the stores?

It seems like more generally, there is a problem with LLVM code not inlining function calls, whether you’re using Base.llvmcall or LLVM.jl. For example:

julia> @generated function addthree(x::Int)
           ptyp = "i64"
           decls = "define $ptyp @addone($ptyp %a) inlinehint { %b = add $ptyp 1, %a\n ret $ptyp %b }"
           instrs = [
               "%apt = add $ptyp %0, 2",
               "%ret = call $ptyp @addone($ptyp %apt)",
               "ret $ptyp %ret"
           ]
           quote
               $(Expr(:meta,:inline))
               Base.llvmcall(
                   $((decls, join(instrs, "\n"))),
                   Int, Tuple{Int}, x
               )
           end
       end
addthree (generic function with 1 method)

julia> addthree(4)
7

julia> @code_llvm addthree(4)

;  @ REPL[12]:2 within `addthree'
; Function Attrs: uwtable
define i64 @julia_addthree_17541(i64) #0 {
top:
; ┌ @ REPL[12]:11 within `macro expansion'
   %apt.i = add i64 %0, 2
   %ret.i = call i64 @addone(i64 %apt.i)
   ret i64 %ret.i
; └
}

If I define @addone using alwaysinline instead of inlinehint, it will inline, but not otherwise (it wont inline without any function attributes either).
This function is an obvious candidate for inlining, and it has no trouble inlining from Julia:

julia> jaddone(x) = x + 1
jaddone (generic function with 1 method)

julia> jaddthree(x) = jaddone(x) + 2
jaddthree (generic function with 1 method)

julia> @code_llvm jaddthree(4)

;  @ REPL[24]:1 within `jaddthree'
; Function Attrs: uwtable
define i64 @julia_jaddthree_17574(i64) #0 {
top:
; ┌ @ int.jl:53 within `+'
   %1 = add i64 %0, 3
; └
  ret i64 %1
}

julia> Base.@ccallable Int function myfun(x::Int)
                  x+1
       end

julia> myfun2(x) = myfun(x) + 2
myfun2 (generic function with 1 method)

julia> myfun2(4)
7

julia> @code_llvm myfun2(4)

;  @ REPL[27]:1 within `myfun2'
; Function Attrs: uwtable
define i64 @julia_myfun2_17599(i64) #0 {
top:
; ┌ @ int.jl:53 within `+'
   %1 = add i64 %0, 3
; └
  ret i64 %1
}

So it seems like LLVM just doesn’t consider inlining at all when using llvmcall or LLVM.jl, like that pass isn’t applied.
EDIT:
Looking here, it seems that -always-inline is applied (and that the alwaysinline attribute can only be applied to a function definition, not to a function declaration?), but -inline is not, although it is applied to regular Julia functions.

Trying to add a PassManagaer does not help:

using LLVM, LLVM.Interop
Base.@ccallable Int function myfun(x::Int)
    x + 1
end
@generated function testjuliainline(y::Int)
    T_int = LLVM.IntType(sizeof(Int)*8, JuliaContext())

    paramtyps = [ T_int ]
    ret_typ = T_int # returning a Ptr{Cvoid}
    llvmf, _ = create_function(ret_typ, paramtyps)

    mod = LLVM.parent(llvmf)
    intrinsic_typ = LLVM.FunctionType(T_int, paramtyps)
    intrinsic = LLVM.Function(mod, "myfun", intrinsic_typ)

    push!(function_attributes(intrinsic), EnumAttribute("alwaysinline", 0, JuliaContext()))
    
    Builder(JuliaContext()) do builder
        entry = BasicBlock(llvmf, "entry", JuliaContext())
        position!(builder, entry)
        val = call!(builder, intrinsic, [parameters(llvmf)[1]])
        ret!(builder, val)
    end
    PassManagerBuilder() do pmb
        optlevel!(pmb, 3)
        inliner!(pmb, 10000)
        FunctionPassManager(mod) do fpm
            populate!(fpm, pmb)
            run!(fpm, llvmf)
        end
    end
    call_function(llvmf, Int, Tuple{Int}, :(y,))
end
myfun(8)
testjuliainline(3)
@code_llvm debuginfo=:none testjuliainline(3)
@code_native debuginfo=:none testjuliainline(3)       

Results in:

julia> myfun(8)
9

julia> testjuliainline(3)
4

julia> @code_llvm debuginfo=:none testjuliainline(3)

; Function Attrs: uwtable
define i64 @julia_testjuliainline_17340(i64) #0 {
top:
  %1 = call i64 @myfun(i64 %0)
  ret i64 %1
}

julia> @code_native debuginfo=:none testjuliainline(3)
        .text
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $32, %rsp
        movabsq $689794224, %rax        # imm = 0x291D6CB0
        callq   *%rax
        addq    $32, %rsp
        popq    %rbp
        retq
        nopw    (%rax,%rax)

Adding function_inlining!(fpm) to the FunctionPassManager results in a segfault.

We only run the always-inliner in LLVM since we can’t assume that IPO is valid on Julia code and we have gotten “interesting” behaviour from time to time. The GPU stack used to do forced inlining on the LLVM level.

You could see how Julia passes the noinline LLVM attribute and replicate that to for always-inline to see if that would allow for inlinling into the body of the llvmcall. I thought that putting alwaysinline on the declaration would be enough, but it seems I was wrong :confused:

Ah. Julia does the inlining at the front end. Somehow I never realized:

Base.@ccallable Int function myfun(x::Int)
    x + 1
end
bar(x) = myfun(x) + 2
@code_typed bar(5)

Yields

julia> @code_typed bar(5)
CodeInfo(
1 ─ %1 = Base.add_int(x, 1)::Int64
│   %2 = Base.add_int(%1, 2)::Int64
└──      return %2
) => Int64

Is your suggestion to look at this line

    if (jl_has_meta(stmts, noinline_sym)) {
        f->addFnAttr(Attribute::NoInline);
    }

within the body of emit_function?

Would I have to modify the Base Julia source code, or is there something else?

Or is there a way to – like how Julia manually inlines functions in the front end – to manually inline the IR of interest?
Reflection is illegal in the bodies of generated functions, so the obvious approach of calling code_llvm probably isn’t going to work.

I also tried using the llvm assume intrinsic yesterday (to assume that the absolute value in difference between pointers was at least the vector lengths in bytes), but, like declaring the constant vectors invariant, that did not work either.

The kind of modification you are proposing are probably best done by modifying the Julia compiler. As an avenue for experimentation you may start with something small – like adding an @llvminline that is similar to @noinline to see if you can get the benefits you want, before doing the bigger step of implementing this in Julia proper.

1 Like

I tried this last night. I could create a PR when I get back home if you want to see the changes.
It didn’t work, but I suspect I may be missing a step.
Aside from adding the macro and editing codegen.cpp, I edited ast.c, and julia_internal.h.

But I don’t know what happens to the meta information between base/compiler/ssair/driver.jl and between jl_has_meta(stmts, llvminline_sym).

I could also try an @restrict macro that adds noalias to each function parameter, unless there’s some way to more intelligently pass this information. I’d have to find where and how to pass parameter attributes.

Also, would there be any way to do this with TBAA?
Replacing b and c in my example with a ntuple (wrapped in a struct) works, producing the following llvm

;; julia> @code_llvm raw=true debuginfo=:none testv!(a, d, e)

define double @"julia_testv!_18248"(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(128), { [16 x double] } addrspace(11)* nocapture nonnull readonly dereferenceable(128), { [16 x double] } addrspace(11)* nocapture nonnull readonly dereferenceable(128)) !dbg !5 {
top:
  %3 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*, !dbg !7
  %4 = addrspacecast %jl_value_t addrspace(11)* %3 to %jl_value_t*
  %ptr.i = bitcast %jl_value_t* %4 to i8*, !dbg !16
  call void @llvm.lifetime.start.p0i8(i64 1024, i8* %ptr.i), !dbg !16
  %5 = bitcast { [16 x double] } addrspace(11)* %1 to <4 x double> addrspace(11)*, !dbg !22
  %6 = load <4 x double>, <4 x double> addrspace(11)* %5, align 8, !dbg !22, !tbaa !31
  %7 = bitcast { [16 x double] } addrspace(11)* %2 to <4 x double> addrspace(11)*, !dbg !34
  %8 = load <4 x double>, <4 x double> addrspace(11)* %7, align 8, !dbg !34, !tbaa !31
  %res.i17 = fmul fast <4 x double> %8, %6, !dbg !38
  %9 = getelementptr inbounds { [16 x double] }, { [16 x double] } addrspace(11)* %1, i64 0, i32 0, i64 4, !dbg !51
  %10 = bitcast double addrspace(11)* %9 to <4 x double> addrspace(11)*, !dbg !22
  %11 = load <4 x double>, <4 x double> addrspace(11)* %10, align 8, !dbg !22, !tbaa !31
  %12 = getelementptr inbounds { [16 x double] }, { [16 x double] } addrspace(11)* %2, i64 0, i32 0, i64 4, !dbg !54
  %13 = bitcast double addrspace(11)* %12 to <4 x double> addrspace(11)*, !dbg !34
  %14 = load <4 x double>, <4 x double> addrspace(11)* %13, align 8, !dbg !34, !tbaa !31
  %res.i17.1 = fmul fast <4 x double> %14, %11, !dbg !38
  %15 = getelementptr inbounds { [16 x double] }, { [16 x double] } addrspace(11)* %1, i64 0, i32 0, i64 8, !dbg !51
  %16 = bitcast double addrspace(11)* %15 to <4 x double> addrspace(11)*, !dbg !22
  %17 = load <4 x double>, <4 x double> addrspace(11)* %16, align 8, !dbg !22, !tbaa !31
  %18 = getelementptr inbounds { [16 x double] }, { [16 x double] } addrspace(11)* %2, i64 0, i32 0, i64 8, !dbg !54
  %19 = bitcast double addrspace(11)* %18 to <4 x double> addrspace(11)*, !dbg !34
  %20 = load <4 x double>, <4 x double> addrspace(11)* %19, align 8, !dbg !34, !tbaa !31
  %res.i17.2 = fmul fast <4 x double> %20, %17, !dbg !38
  %21 = getelementptr inbounds { [16 x double] }, { [16 x double] } addrspace(11)* %1, i64 0, i32 0, i64 12, !dbg !51
  %22 = bitcast double addrspace(11)* %21 to <4 x double> addrspace(11)*, !dbg !22
  %23 = load <4 x double>, <4 x double> addrspace(11)* %22, align 8, !dbg !22, !tbaa !31
  %24 = getelementptr inbounds { [16 x double] }, { [16 x double] } addrspace(11)* %2, i64 0, i32 0, i64 12, !dbg !54
  %25 = bitcast double addrspace(11)* %24 to <4 x double> addrspace(11)*, !dbg !34
  %26 = load <4 x double>, <4 x double> addrspace(11)* %25, align 8, !dbg !34, !tbaa !31
  %res.i17.3 = fmul fast <4 x double> %26, %23, !dbg !38
  %res.i13.1 = fadd fast <4 x double> %res.i17.1, %res.i17, !dbg !55
  %res.i13.2 = fadd fast <4 x double> %res.i17.2, %res.i13.1, !dbg !55
  %res.i13.3 = fadd fast <4 x double> %res.i17.3, %res.i13.2, !dbg !55
  call void @llvm.lifetime.end.p0i8(i64 1024, i8* %ptr.i), !dbg !62
  %vec_2_1.i = shufflevector <4 x double> %res.i13.3, <4 x double> undef, <2 x i32> <i32 0, i32 1>, !dbg !66
  %vec_2_2.i = shufflevector <4 x double> %res.i13.3, <4 x double> undef, <2 x i32> <i32 2, i32 3>, !dbg !66
  %vec_2.i = fadd <2 x double> %vec_2_1.i, %vec_2_2.i, !dbg !66
  %vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer, !dbg !66
  %vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>, !dbg !66
  %vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i, !dbg !66
  %res.i = extractelement <1 x double> %vec_1.i, i32 0, !dbg !66
  ret double %res.i, !dbg !71
}

I imagine the tbaa !31 are why this works (even though marking memory as invariant using the invariant.start intrinsic does not work)? And that in this case that tag corresponds to tbaa_immut?

Is there anything like this that can be done?

I do not know how to pass tbaa information at all using LLVM.jl, even though instructions.jl seems to have some relevant definitions, I could not get any !tbaa tags to appear in the generated LLVM.
Using llvmcall to insert the metadata errors, because I have to somehow map numbers (which are not always the same) to the global definitions, ie, just picking !tbaa !31 because I saw that above does not work.

Similarly, I can try to use alias.scope metadata with llvmcall, but LLVM maps the same numbers and meta data definitions from each call to a different set in the final code, so that I can’t actually place separate loads/stores into the same alias scope.

1 Like

You can use reflection to see where it gets lost, but yes a PR would be interesting as well:

julia> @noinline f(x) = x
f (generic function with 1 method)

julia> @code_lowered f(1)
CodeInfo(
1 ─     $(Expr(:meta, :noinline))
└──     return x
)

julia> @code_typed optimize=false f(1)
CodeInfo(
1 ─     $(Expr(:meta, :noinline))
└──     return x
) => Int64

julia> @code_typed optimize=true f(1)
CodeInfo(
1 ─     return x
2 ─     $(Expr(:meta, :noinline))
) => Int64

julia> @code_llvm f(1)

;  @ REPL[1]:1 within `f'
; Function Attrs: noinline
define i64 @julia_f_16060(i64) #0 {
top:
  ret i64 %0
}

Right these numbers are module specific to see the metadata to can use @code_llvm raw=true dump_module=true and look at the bottom. You need to insert metadata like that.

I sadly don’t know much about how TBAA works :confused: