Is there an equivalent to cross-language link time optimization via LLVM?

What’s happening here? Are we just spitting LLM text at each other here? We’re a community of humans, let’s please talk to eachother.

3 Likes

Sorry there copied from the recent commit for the project that works with coss language lto and also the changelog refrences templates for c++ directly so i just pasted that too.

--- Per-call overhead (single invocation, 10000 samples) ---

  pure_julia           scalar_add                median=30.0 ns
  bare_ccall           scalar_add                median=30.0 ns
  wrapper_ccall        scalar_add                median=30.0 ns
  lto_llvmcall         scalar_add                median=30.0 ns
  pure_julia           scalar_mul                median=40.0 ns
  bare_ccall           scalar_mul                median=40.0 ns
  wrapper_ccall        scalar_mul                median=30.0 ns
  lto_llvmcall         scalar_mul                median=30.0 ns
  pure_julia           make_point                median=40.0 ns
  bare_ccall           make_point                median=40.0 ns
  wrapper_ccall        make_point                median=40.0 ns
  lto_llvmcall         make_point                median=40.0 ns
  bare_ccall_UNSAFE    pack_record               ⚠ skipped (packed struct ABI crash)
  wrapper_ccall        pack_record               median=80.0 ns

--- Hot loop: 1000000 iterations of add_to(acc, val) ---

  pure_julia           total=0.68 ms   0.677 ns/iter
  bare_ccall_loop      total=2.03 ms   2.026 ns/iter
  wrapper_ccall_loop   total=23.85 ms   23.854 ns/iter
  lto_llvmcall_loop    total=25.59 ms   25.594 ns/iter
  whole_loop_in_cpp    total=1.0 ms   0.997 ns/iter 

So this is where optimizing the code comes in because I do have the llvmcall in place but im not currently matching the bare ccall. I also just got alot of this working in the last 4 months and it could be optimized for sure.

Those medians all being divisible by 10.0ns suggests to me the single “invocation” (evaluation?) ran into timer resolution. More evaluations per sample like the hot loop benchmark would address that. The wrapper adding 20ns to a ccall is really bad. Have you checked if LLVM optimizes away the wrapper’s branch or if the wrapper’s calls get inlined to other expressions?

ok i did have a issue in my wrapper and i fixed my code on the c++ side and i have better results and there very good

Per-Call Overhead (vs Bare ccall)

Scenario Tier Median (ns) Vs Bare ccall Note
scalar_add pure_julia 31.0 1.0x Julia native a+b
scalar_add bare_ccall 31.0 1.0x Hand-written ccall (community baseline)
scalar_add wrapper_ccall 31.0 1.0x RepliBuild generated ccall wrapper
scalar_add lto_llvmcall 31.0 1.0x RepliBuild LTO: Base.llvmcall (Julia JIT inlines C++ IR)
scalar_mul pure_julia 41.0 1.0x Julia native a*b
scalar_mul bare_ccall 41.0 1.0x Hand-written ccall
scalar_mul wrapper_ccall 61.0 1.5x RepliBuild generated ccall wrapper
scalar_mul lto_llvmcall 41.0 1.0x RepliBuild LTO: Base.llvmcall
make_point pure_julia 41.0 1.0x Julia native struct construction
make_point bare_ccall 41.0 1.0x Hand-written ccall (struct return, manual layout)
make_point wrapper_ccall 41.0 1.0x RepliBuild generated wrapper (ABI-verified layout)
make_point lto_llvmcall 41.0 1.0x RepliBuild LTO: Base.llvmcall
pack_record bare_ccall_UNSAFE NaN :warning: Naive ccall — packed struct return crashes; cannot safely benchmark
pack_record wrapper_ccall 81.0 RepliBuild generated wrapper (DWARF-verified packed layout)

Hot Loop (2,000,000 Iterations)

Running add_to(acc, val) continuously across the FFI boundary.

Tier Median (ns) ns / iter Note
pure_julia 677,784.0 0.677 Julia @inbounds loop with native add
bare_ccall_loop 2,800,660.0 1.801 Julia loop — bare ccall in a typed function
wrapper_ccall_loop 677,889.0 0.677 Julia loop calling RepliBuild ccall wrapper each iteration
lto_llvmcall_loop 677,624.0 0.677 Julia loop with LTO: Julia JIT inlines C++ add_to across FFI boundary
whole_loop_in_cpp 997,968.0 0.997 Single ccall to C++ accumulate_array (entire loop in C++)

That just made c++ faster than c++

Can you put these benchmarks on Github and just share links with short introductions here? Pasting a couple snippets is fine but having to scroll through a table is a bit much. Same with docs sections, that was a bit confusing to some people without much explanation.

As for the benchmark, those numbers fit the goal, but how is the ccall-wrapper doing as well as pure Julia or the llvmcall wrapper? I’d expect there to be a ccall overhead each iteration like the bare ccall run.

the repo is public and i commit regularly, all the test are also included inside the repo, RepliBuild.jl is registered but you will need linux, llvm 21.1.8 or any major 21 and aur mlir 21 version to match because RepliBuild uses newer jit features not in mlir.jl, so there is a custom c api bridge for that, basically any linux with up to date julia and llvm and mlir will work

also you caught something else i forgot to take enable_lto = true and RepliBuild optimized the ccall to dispacth it as llvmcall… so im going to run the test and take the flag out of the toml for the ccall and re run the test to see what un opted ccall generates compared, but yeah replibuild did what it was supposed to and julia just dispatched it to llvmcall because if the flag

Running add_to(acc, val) continuously across the FFI boundary.

Tier Median (ns) ns / iter Note
pure_julia 676,738.0 0.677 Julia @inbounds loop with native add
bare_ccall_loop 1,800,310.0 1.800 Julia loop — bare ccall in a typed function
wrapper_ccall_loop 2,025,930.0 2.026 Julia loop calling RepliBuild ccall wrapper (LTO disabled fallback)
lto_llvmcall_loop 677,078.0 0.677 Julia loop with LTO: Julia JIT inlines C++ add_to across FFI boundary
whole_loop_in_cpp 997,147.0 0.997 Single ccall to C++ accumulate_array (entire loop in C++)