What’s happening here? Are we just spitting LLM text at each other here? We’re a community of humans, let’s please talk to eachother.
Sorry there copied from the recent commit for the project that works with coss language lto and also the changelog refrences templates for c++ directly so i just pasted that too.
--- Per-call overhead (single invocation, 10000 samples) ---
pure_julia scalar_add median=30.0 ns
bare_ccall scalar_add median=30.0 ns
wrapper_ccall scalar_add median=30.0 ns
lto_llvmcall scalar_add median=30.0 ns
pure_julia scalar_mul median=40.0 ns
bare_ccall scalar_mul median=40.0 ns
wrapper_ccall scalar_mul median=30.0 ns
lto_llvmcall scalar_mul median=30.0 ns
pure_julia make_point median=40.0 ns
bare_ccall make_point median=40.0 ns
wrapper_ccall make_point median=40.0 ns
lto_llvmcall make_point median=40.0 ns
bare_ccall_UNSAFE pack_record ⚠ skipped (packed struct ABI crash)
wrapper_ccall pack_record median=80.0 ns
--- Hot loop: 1000000 iterations of add_to(acc, val) ---
pure_julia total=0.68 ms 0.677 ns/iter
bare_ccall_loop total=2.03 ms 2.026 ns/iter
wrapper_ccall_loop total=23.85 ms 23.854 ns/iter
lto_llvmcall_loop total=25.59 ms 25.594 ns/iter
whole_loop_in_cpp total=1.0 ms 0.997 ns/iter
So this is where optimizing the code comes in because I do have the llvmcall in place but im not currently matching the bare ccall. I also just got alot of this working in the last 4 months and it could be optimized for sure.
Those medians all being divisible by 10.0ns suggests to me the single “invocation” (evaluation?) ran into timer resolution. More evaluations per sample like the hot loop benchmark would address that. The wrapper adding 20ns to a ccall is really bad. Have you checked if LLVM optimizes away the wrapper’s branch or if the wrapper’s calls get inlined to other expressions?
ok i did have a issue in my wrapper and i fixed my code on the c++ side and i have better results and there very good
Per-Call Overhead (vs Bare ccall)
| Scenario | Tier | Median (ns) | Vs Bare ccall |
Note |
|---|---|---|---|---|
scalar_add |
pure_julia |
31.0 | 1.0x | Julia native a+b |
scalar_add |
bare_ccall |
31.0 | 1.0x | Hand-written ccall (community baseline) |
scalar_add |
wrapper_ccall |
31.0 | 1.0x | RepliBuild generated ccall wrapper |
scalar_add |
lto_llvmcall |
31.0 | 1.0x | RepliBuild LTO: Base.llvmcall (Julia JIT inlines C++ IR) |
scalar_mul |
pure_julia |
41.0 | 1.0x | Julia native a*b |
scalar_mul |
bare_ccall |
41.0 | 1.0x | Hand-written ccall |
scalar_mul |
wrapper_ccall |
61.0 | 1.5x | RepliBuild generated ccall wrapper |
scalar_mul |
lto_llvmcall |
41.0 | 1.0x | RepliBuild LTO: Base.llvmcall |
make_point |
pure_julia |
41.0 | 1.0x | Julia native struct construction |
make_point |
bare_ccall |
41.0 | 1.0x | Hand-written ccall (struct return, manual layout) |
make_point |
wrapper_ccall |
41.0 | 1.0x | RepliBuild generated wrapper (ABI-verified layout) |
make_point |
lto_llvmcall |
41.0 | 1.0x | RepliBuild LTO: Base.llvmcall |
pack_record |
bare_ccall_UNSAFE |
NaN | — | ccall — packed struct return crashes; cannot safely benchmark |
pack_record |
wrapper_ccall |
81.0 | — | RepliBuild generated wrapper (DWARF-verified packed layout) |
Hot Loop (2,000,000 Iterations)
Running add_to(acc, val) continuously across the FFI boundary.
| Tier | Median (ns) | ns / iter | Note |
|---|---|---|---|
pure_julia |
677,784.0 | 0.677 | Julia @inbounds loop with native add |
bare_ccall_loop |
2,800,660.0 | 1.801 | Julia loop — bare ccall in a typed function |
wrapper_ccall_loop |
677,889.0 | 0.677 | Julia loop calling RepliBuild ccall wrapper each iteration |
lto_llvmcall_loop |
677,624.0 | 0.677 | Julia loop with LTO: Julia JIT inlines C++ add_to across FFI boundary |
whole_loop_in_cpp |
997,968.0 | 0.997 | Single ccall to C++ accumulate_array (entire loop in C++) |
That just made c++ faster than c++
Can you put these benchmarks on Github and just share links with short introductions here? Pasting a couple snippets is fine but having to scroll through a table is a bit much. Same with docs sections, that was a bit confusing to some people without much explanation.
As for the benchmark, those numbers fit the goal, but how is the ccall-wrapper doing as well as pure Julia or the llvmcall wrapper? I’d expect there to be a ccall overhead each iteration like the bare ccall run.
the repo is public and i commit regularly, all the test are also included inside the repo, RepliBuild.jl is registered but you will need linux, llvm 21.1.8 or any major 21 and aur mlir 21 version to match because RepliBuild uses newer jit features not in mlir.jl, so there is a custom c api bridge for that, basically any linux with up to date julia and llvm and mlir will work
also you caught something else i forgot to take enable_lto = true and RepliBuild optimized the ccall to dispacth it as llvmcall… so im going to run the test and take the flag out of the toml for the ccall and re run the test to see what un opted ccall generates compared, but yeah replibuild did what it was supposed to and julia just dispatched it to llvmcall because if the flag
Running add_to(acc, val) continuously across the FFI boundary.
| Tier | Median (ns) | ns / iter | Note |
|---|---|---|---|
pure_julia |
676,738.0 | 0.677 | Julia @inbounds loop with native add |
bare_ccall_loop |
1,800,310.0 | 1.800 | Julia loop — bare ccall in a typed function |
wrapper_ccall_loop |
2,025,930.0 | 2.026 | Julia loop calling RepliBuild ccall wrapper (LTO disabled fallback) |
lto_llvmcall_loop |
677,078.0 | 0.677 | Julia loop with LTO: Julia JIT inlines C++ add_to across FFI boundary |
whole_loop_in_cpp |
997,147.0 | 0.997 | Single ccall to C++ accumulate_array (entire loop in C++) |