Using Base.llvmcall for cross language LTO

So I use a custom c abi with mlir 21.1.8 for the newer jit features not present in MLIR.jl and when using llvmcall for the saved jit thunks I get a version mismatch in the IR, i tried sanitizing the unknown features from the IR for julia but this is very difficult and was wondering if the wrapper doesnt trigger the MLIR jit I should be able to use julia internal llvm completely for c sources but templates, c++ callbacks still route through the mlir 21 jit. I get perfect results from inlining the c++ with the julia jit through llvmcall. C++ can hit julia speed since its julias jit doing the work and just casting the data from the .bc, and shared library. This also opens the door for AD across language boundries since traditional ccall stopped the julia jit from following in the IR

For function int add(int a, int b):

function add(a::Cint, b::Cint)::Cint
    if !isempty(LTO_IR)
        return Base.llvmcall((LTO_IR, "_Z3addii"), Cint, Tuple{Cint, Cint}, a, b)
    else
        return ccall((:_Z3addii, LIBRARY_PATH), Cint, (Cint, Cint), a, b)
    end
end

If we compile C/C++ to LLVM bitcode (.bc) and hand it to Base.llvmcall, Julia’s own LLVM JIT compiles it. The C/C++ IR becomes visible to Julia’s optimization pipeline β€” inlining,
SROA, vectorization, and AD all work as if the code were native Julia:

Bitcode loaded once at module parse time

const LTO_IR_PATH = joinpath(@__DIR__, "mylib_lto.bc")
const LTO_IR = isfile(LTO_IR_PATH) ? read(LTO_IR_PATH) : UInt8[]

function add(a::Cint, b::Cint)::Cint
    if !isempty(LTO_IR)
        # Julia's JIT compiles this β€” full optimization, AD-transparent
        return Base.llvmcall((LTO_IR, "_Z3addii"), Cint, Tuple{Cint, Cint}, a, b)
    else
        # Fallback: traditional opaque FFI
        return ccall((:_Z3addii, LIBRARY_PATH), Cint, (Cint, Cint), a, b)
    end
end

The (LTO_IR, β€œ_Z3addii”) form of llvmcall takes a bitcode module and a function name. Julia’s LLVM parses the bitcode, finds the function, and inlines it directly into the calling
code. The C++ literally runs at Julia speed because it is Julia’s JIT doing the work.

For complex C++ that needs the MLIR 21 JIT (virtual methods, template instantiations), we compile MLIR→LLVM IR, sanitize, and assemble AOT thunk bitcode so even those calls go through
llvmcall:

Complex C++ with virtual dispatch β€” still goes through Julia’s JIT

function call_virtual_method(obj::Ptr{MyClass}, x::Cdouble)::Cdouble
    inner_ptrs = Ptr{Cvoid}[Ptr{Cvoid}(obj), reinterpret(Ptr{Cvoid}, Ref(x))]
    if !isempty(THUNKS_LTO_IR)
        return Base.llvmcall(
            (THUNKS_LTO_IR, "_mlir_ciface_MyClass_method_thunk"),
            Cdouble, Tuple{Ptr{Ptr{Cvoid}}}, inner_ptrs)
    else
        ccall((:_mlir_ciface_MyClass_method_thunk, THUNKS_LIBRARY_PATH),
              Cdouble, (Ptr{Ptr{Cvoid}},), inner_ptrs)
    end
end

The real payoff is that AD tools can now differentiate through C/C++ code. Since llvmcall makes the IR visible to Julia’s compiler, Enzyme (or any LLVM-level AD) can follow the data
flow straight through what used to be an opaque ccall wall. This opens the door to differentiating mixed Julia/C++ codebases without manually writing adjoints for every foreign function.

Tier | Median (ns) | ns / iter | Note

pure_julia | 676,738.0 | 0.677 | Julia @inbounds loop with native add
bare_ccall_loop | 1,800,310.0 | 1.800 | Julia loop β€” bare ccall in a typed function
wrapper_ccall_loop | 2,025,930.0 | 2.026 | Julia loop calling ccall wrapper(no LTO)
lto_llvmcall_loop | 677,078.0 | 0.677 | Julia loop with LTO
whole_loop_in_cpp | 997,147.0 | 0.997 | Single ccall to C++ accumulate_array

1 Like

To be extremely clear, there’s nothing novel about this. This approach had been demonstrated in the paper Scalable Automatic Differentiation of Multiple Parallel Paradigms through Compiler Augmentation, by @wsmoses @vchuravy and their collaborators, and has been discussed a few times already here on Discourse. As mentioned in Is there an equivalent to cross-language link time optimization via LLVM? - #18 by vchuravy (you even intervened in that thread), the main challenge is an infrastructural one: you need to make sure you use compatible versions of LLVM to compile every piece of code, to be able to merge all bitcodes.

No but a pkg that does it automatically through a toml config is really novel

Quoting from your README

This requires the user to have a local toolchain installed

which is the entire problem, and I’m missing how that’s solved β€œautomatically”.