Is there an equivalent to cross-language link time optimization via LLVM?

mbauman · March 5, 2026, 7:21pm

What’s happening here? Are we just spitting LLM text at each other here? We’re a community of humans, let’s please talk to eachother.

obsidianjulua · March 5, 2026, 7:25pm

Sorry there copied from the recent commit for the project that works with coss language lto and also the changelog refrences templates for c++ directly so i just pasted that too.

obsidianjulua · March 6, 2026, 3:45am

--- Per-call overhead (single invocation, 10000 samples) ---

  pure_julia           scalar_add                median=30.0 ns
  bare_ccall           scalar_add                median=30.0 ns
  wrapper_ccall        scalar_add                median=30.0 ns
  lto_llvmcall         scalar_add                median=30.0 ns
  pure_julia           scalar_mul                median=40.0 ns
  bare_ccall           scalar_mul                median=40.0 ns
  wrapper_ccall        scalar_mul                median=30.0 ns
  lto_llvmcall         scalar_mul                median=30.0 ns
  pure_julia           make_point                median=40.0 ns
  bare_ccall           make_point                median=40.0 ns
  wrapper_ccall        make_point                median=40.0 ns
  lto_llvmcall         make_point                median=40.0 ns
  bare_ccall_UNSAFE    pack_record               ⚠ skipped (packed struct ABI crash)
  wrapper_ccall        pack_record               median=80.0 ns

--- Hot loop: 1000000 iterations of add_to(acc, val) ---

  pure_julia           total=0.68 ms   0.677 ns/iter
  bare_ccall_loop      total=2.03 ms   2.026 ns/iter
  wrapper_ccall_loop   total=23.85 ms   23.854 ns/iter
  lto_llvmcall_loop    total=25.59 ms   25.594 ns/iter
  whole_loop_in_cpp    total=1.0 ms   0.997 ns/iter

So this is where optimizing the code comes in because I do have the llvmcall in place but im not currently matching the bare ccall. I also just got alot of this working in the last 4 months and it could be optimized for sure.

Benny · March 6, 2026, 4:07am

Those medians all being divisible by 10.0ns suggests to me the single “invocation” (evaluation?) ran into timer resolution. More evaluations per sample like the hot loop benchmark would address that. The wrapper adding 20ns to a ccall is really bad. Have you checked if LLVM optimizes away the wrapper’s branch or if the wrapper’s calls get inlined to other expressions?

obsidianjulua · March 6, 2026, 4:07am

ok i did have a issue in my wrapper and i fixed my code on the c++ side and i have better results and there very good

obsidianjulua · March 6, 2026, 4:15am

Per-Call Overhead (vs Bare `ccall`)

Scenario	Tier	Median (ns)	Vs Bare `ccall`	Note
`scalar_add`	`pure_julia`	31.0	1.0x	Julia native `a+b`
`scalar_add`	`bare_ccall`	31.0	1.0x	Hand-written `ccall` (community baseline)
`scalar_add`	`wrapper_ccall`	31.0	1.0x	RepliBuild generated `ccall` wrapper
`scalar_add`	`lto_llvmcall`	31.0	1.0x	RepliBuild LTO: `Base.llvmcall` (Julia JIT inlines C++ IR)
`scalar_mul`	`pure_julia`	41.0	1.0x	Julia native `a*b`
`scalar_mul`	`bare_ccall`	41.0	1.0x	Hand-written `ccall`
`scalar_mul`	`wrapper_ccall`	61.0	1.5x	RepliBuild generated `ccall` wrapper
`scalar_mul`	`lto_llvmcall`	41.0	1.0x	RepliBuild LTO: `Base.llvmcall`
`make_point`	`pure_julia`	41.0	1.0x	Julia native struct construction
`make_point`	`bare_ccall`	41.0	1.0x	Hand-written `ccall` (struct return, manual layout)
`make_point`	`wrapper_ccall`	41.0	1.0x	RepliBuild generated wrapper (ABI-verified layout)
`make_point`	`lto_llvmcall`	41.0	1.0x	RepliBuild LTO: `Base.llvmcall`
`pack_record`	`bare_ccall_UNSAFE`	NaN	—	Naive `ccall` — packed struct return crashes; cannot safely benchmark
`pack_record`	`wrapper_ccall`	81.0	—	RepliBuild generated wrapper (DWARF-verified packed layout)

Hot Loop (2,000,000 Iterations)

Running add_to(acc, val) continuously across the FFI boundary.

Tier	Median (ns)	ns / iter	Note
`pure_julia`	677,784.0	0.677	Julia `@inbounds` loop with native add
`bare_ccall_loop`	2,800,660.0	1.801	Julia loop — bare `ccall` in a typed function
`wrapper_ccall_loop`	677,889.0	0.677	Julia loop calling RepliBuild `ccall` wrapper each iteration
`lto_llvmcall_loop`	677,624.0	0.677	Julia loop with LTO: Julia JIT inlines C++ `add_to` across FFI boundary
`whole_loop_in_cpp`	997,968.0	0.997	Single `ccall` to C++ `accumulate_array` (entire loop in C++)

obsidianjulua · March 6, 2026, 4:20am

That just made c++ faster than c++

Benny · March 6, 2026, 4:27am

Can you put these benchmarks on Github and just share links with short introductions here? Pasting a couple snippets is fine but having to scroll through a table is a bit much. Same with docs sections, that was a bit confusing to some people without much explanation.

As for the benchmark, those numbers fit the goal, but how is the ccall-wrapper doing as well as pure Julia or the llvmcall wrapper? I’d expect there to be a ccall overhead each iteration like the bare ccall run.

obsidianjulua · March 6, 2026, 4:32am

the repo is public and i commit regularly, all the test are also included inside the repo, RepliBuild.jl is registered but you will need linux, llvm 21.1.8 or any major 21 and aur mlir 21 version to match because RepliBuild uses newer jit features not in mlir.jl, so there is a custom c api bridge for that, basically any linux with up to date julia and llvm and mlir will work

obsidianjulua · March 6, 2026, 4:39am

also you caught something else i forgot to take enable_lto = true and RepliBuild optimized the ccall to dispacth it as llvmcall… so im going to run the test and take the flag out of the toml for the ccall and re run the test to see what un opted ccall generates compared, but yeah replibuild did what it was supposed to and julia just dispatched it to llvmcall because if the flag

obsidianjulua · March 6, 2026, 4:47am

Running add_to(acc, val) continuously across the FFI boundary.

Tier	Median (ns)	ns / iter	Note
`pure_julia`	676,738.0	0.677	Julia `@inbounds` loop with native add
`bare_ccall_loop`	1,800,310.0	1.800	Julia loop — bare `ccall` in a typed function
`wrapper_ccall_loop`	2,025,930.0	2.026	Julia loop calling RepliBuild `ccall` wrapper (LTO disabled fallback)
`lto_llvmcall_loop`	677,078.0	0.677	Julia loop with LTO: Julia JIT inlines C++ `add_to` across FFI boundary
`whole_loop_in_cpp`	997,147.0	0.997	Single `ccall` to C++ `accumulate_array` (entire loop in C++)

Topic		Replies	Views
Using Base.llvmcall for cross language LTO General Usage	5	114	March 7, 2026
Ccall for LTO libraries with LLVM backend Internals & Design	0	357	March 29, 2021
[ANN] RepliBuild.jl - A full C/C++ interop toolkit for tiered FFI generation Package Announcements	9	485	March 28, 2026
Why is usage of llvmcall so restricted? General Usage	12	679	June 7, 2024
Can you call Julia methods with LLVM call? Performance	15	2263	October 1, 2022

Is there an equivalent to cross-language link time optimization via LLVM?

Per-Call Overhead (vs Bare ccall)

Hot Loop (2,000,000 Iterations)

Related topics

Per-Call Overhead (vs Bare `ccall`)