Researchers from the Technical University of Munich (TUM) have announced TPDE as a fast and adaptable compiler back-end framework.
I wonder if it could be useful to Julia in the future.
Researchers from the Technical University of Munich (TUM) have announced TPDE as a fast and adaptable compiler back-end framework.
I wonder if it could be useful to Julia in the future.
Beat me to it… had the same thought. Since something like this works with LLVM at some level, how much effort would be required to “drop it in?”
The primary goal is low-latency compilation while maintaining reasonable (
-O0
) code quality, e.g., as baseline compiler for JIT compilation or unoptimized builds.
Currently, TPDE only targets ELF-based x86-64 and AArch64 (Armv8.1) platforms.
TPDE seems to be a research tool/benchmark, not aimed at production.
not really if -O0 stays where it is relative to -O3 in terms of performance.
In addition to the limited number of targets, do other unsupported features affect Julia?
i64
except i128
(i128
is supported), pointers with non-zero address space, half
, bfloat
, ppc_fp128
, x86_fp80
, x86_amx
. Code with x86-64 long double
needs to be compiled with -mlong-double-64
.<32 x i8>
on x86-64); icmp
/fcmp
; pointer element type; getelementptr
with vector types; select
with vector predicate, integer extension/truncation,select
aggregate type other than {i64, i64}
.bitcast
larger than 64 bit.seqcst
for atomicrmw
).fp128
: fneg
, fcmp one/ueq
, many intrinsics.goto
(blockaddress
, indirectbr
).landingpad
with non-empty filter
clause.llvm.cttz
only for 8/16/32/64-bit).Of these the ones that would be an issue:
i64
except i128
(i128
is supported), pointers with non-zero address space, half
, bfloat
not really if -O0 stays where it is relative to -O3 in terms of performance.
What about -O1
? The report claims similar speedups for that optimization level aswell.
Also cross-posting from the other thread I opened (TPDE-LLVM for 10-20x faster -O1 compilation) to provide some more links:
I just saw this post on hackernews, where an LLVM compiler backend (github) was posted that claims to get 10-20x compile time speedup compared to an LLVM 19 baseline.
Quoting from the linked LLVM-Discourse thread:
How to use TPDE-LLVM?
The LLVM back-end is usable as a library (e.g., for JIT compilation, also usable with ORC JIT), as
llc
-like tool, and can be integrated in Clang (needs a patch, plugins can’t provide a custom back-end right now). Some more details here.So I am wondering if it would be possible to compile Julia with this LLVM backend, at least for use-cases for which O0 or O1 type performance is acceptable.
I don’t know too much about how exactly LLVM is currently called from Julia, but quoting the TPDE-LLVM docs:
Library usage is possible through tpde_llvm::LLVMCompiler, which supports compiling a module to an object file or mapping it into the memory of the current process for JIT execution. The JIT mapper only supports very typical ELF constructs (e.g., no TLS), if this is not sufficient, the object file can also be mapped through LLVM’s ORC JIT (see tpde-llvm/tools/tpde-lli.cpp for an example).
Perhaps someone can comment on the feasibilty here.
-O1
doesn’t vectorise:
% julia -O1 -q
julia> code_llvm((Float64, Vector{Float64}, Vector{Float64}); debuginfo=:none) do a, x, y
for idx in eachindex(x, y)
y[idx] = muladd(a, x[idx], y[idx])
end
end
; Function Signature: var"#2"(Float64, Array{Float64, 1}, Array{Float64, 1})
define void @"julia_#2_903"(double %"a::Float64", ptr noundef nonnull align 8 dereferenceable(24) %"x::Array", ptr noundef nonnull align 8 dereferenceable(24) %"y::Array") local_unnamed_addr #0 {
top:
%"new::OneTo" = alloca [1 x i64], align 8
%"new::OneTo1" = alloca [1 x i64], align 8
%"new::Tuple40" = alloca [1 x i64], align 8
%"new::Tuple42" = alloca [1 x i64], align 8
%"x::Array.size_ptr" = getelementptr inbounds nuw i8, ptr %"x::Array", i64 16
%"x::Array.size.0.copyload" = load i64, ptr %"x::Array.size_ptr", align 8
store i64 %"x::Array.size.0.copyload", ptr %"new::OneTo", align 8
%"y::Array.size_ptr" = getelementptr inbounds nuw i8, ptr %"y::Array", i64 16
%"y::Array.size.0.copyload" = load i64, ptr %"y::Array.size_ptr", align 8
store i64 %"y::Array.size.0.copyload", ptr %"new::OneTo1", align 8
%.not.not = icmp eq i64 %"y::Array.size.0.copyload", %"x::Array.size.0.copyload"
br i1 %.not.not, label %L24, label %L21
L21: ; preds = %top
call void @j_throw_eachindex_mismatch_indices_906(ptr nonnull @"jl_global#907.jit", ptr nocapture nonnull readonly %"new::OneTo", ptr nocapture nonnull readonly %"new::OneTo1") #6
unreachable
L24: ; preds = %top
%0 = icmp slt i64 %"x::Array.size.0.copyload", 1
br i1 %0, label %L104, label %L33
L33: ; preds = %L88, %L24
%value_phi6 = phi i64 [ %6, %L88 ], [ 1, %L24 ]
%1 = add i64 %value_phi6, -1
%"x::Array.size9.0.copyload" = load i64, ptr %"x::Array.size_ptr", align 8
%.not = icmp ult i64 %1, %"x::Array.size9.0.copyload"
br i1 %.not, label %L50, label %L46
L46: ; preds = %L33
store i64 %value_phi6, ptr %"new::Tuple42", align 8
call void @j_throw_boundserror_905(ptr nonnull %"x::Array", ptr nocapture nonnull readonly %"new::Tuple42") #6
unreachable
L50: ; preds = %L33
%"y::Array.size12.0.copyload" = load i64, ptr %"y::Array.size_ptr", align 8
%.not50 = icmp ult i64 %1, %"y::Array.size12.0.copyload"
br i1 %.not50, label %L88, label %L65
L65: ; preds = %L50
store i64 %value_phi6, ptr %"new::Tuple40", align 8
call void @j_throw_boundserror_905(ptr nonnull %"y::Array", ptr nocapture nonnull readonly %"new::Tuple40") #6
unreachable
L88: ; preds = %L50
%memoryref_data = load ptr, ptr %"x::Array", align 8
%memoryref_byteoffset = shl i64 %1, 3
%memoryref_data10 = getelementptr inbounds i8, ptr %memoryref_data, i64 %memoryref_byteoffset
%2 = load double, ptr %memoryref_data10, align 8
%3 = fmul contract double %"a::Float64", %2
%memoryref_data14 = load ptr, ptr %"y::Array", align 8
%memoryref_data22 = getelementptr inbounds i8, ptr %memoryref_data14, i64 %memoryref_byteoffset
%4 = load double, ptr %memoryref_data22, align 8
%5 = fadd contract double %3, %4
store double %5, ptr %memoryref_data22, align 8
%.not51 = icmp eq i64 %value_phi6, %"x::Array.size.0.copyload"
%6 = add i64 %value_phi6, 1
br i1 %.not51, label %L104, label %L33
L104: ; preds = %L88, %L24
ret void
}
Okay, but I think there are many uses cases where vectorization doesn’t matter as much, and TTFX dominates the overall user time.
absolutely agreed.
It might be easier to use this TPDE-LLVM (for LLVM compatibility), or as I suggest, bypass LLVM (like Zig is doing; there no longer default):
Linked in that thread I found an insightful comment by @stevengj that seems relevant here, in response to replacing LLVM with a hand-rolled compiler Ă la Zig.
I agree, and I don’t foresee the Julia compiler team writing their own LLVM replacement; it sure doesn’t seem like they are currently starved for work, especially with 1.12 around the corner. The case of TPDE, though, seems to be exactly the case where we can potentially piggyback off of someone else’s priorities.
In truth, however, I don’t know if this would be a case of “just recompile Julia with TPDE-LLVM” (feasible) or rather a “someone has to start maintaining a new code-generation branch suitable for TPDE-LLVM”, which most likely nobody can commit to long-term at the moment. I would still be interested to hear comments on that.
Finally, of course new funding is usually coupled with new potential applications, and I do believe that a drastically faster -O1
compiler could unlock new industry or research opportunities.
@RomeoV it’s in between. TPDE probably needs some significant improvements before it could be a Julia backend (see my rough list above), but after that, the Julia side should be pretty easy.