TPDE: A Fast Adaptable Compiler Back-End Framework

Researchers from the Technical University of Munich (TUM) have announced TPDE as a fast and adaptable compiler back-end framework.

I wonder if it could be useful to Julia in the future.

7 Likes

Beat me to it… had the same thought. Since something like this works with LLVM at some level, how much effort would be required to “drop it in?”

The primary goal is low-latency compilation while maintaining reasonable (-O0 ) code quality, e.g., as baseline compiler for JIT compilation or unoptimized builds.
Currently, TPDE only targets ELF-based x86-64 and AArch64 (Armv8.1) platforms.

TPDE seems to be a research tool/benchmark, not aimed at production.

1 Like

not really if -O0 stays where it is relative to -O3 in terms of performance.

2 Likes

In addition to the limited number of targets, do other unsupported features affect Julia?

  • Targets other than x86-64-v1/AArch64 (ARMv8.1) (Linux) ELF.
  • Code models other than Small-PIC.
  • Scalar types: integer types larger than i64 except i128 (i128 is supported), pointers with non-zero address space, half, bfloat, ppc_fp128, x86_fp80, x86_amx. Code with x86-64 long double needs to be compiled with -mlong-double-64.
  • Vectors: types that are not directly legal on the target (e.g., <32 x i8> on x86-64); icmp/fcmp; pointer element type; getelementptr with vector types; select with vector predicate, integer extension/truncation,
  • select aggregate type other than {i64, i64}.
  • bitcast larger than 64 bit.
  • Atomic operations might use a stronger consistency than required (e.g., always seqcst for atomicrmw).
  • Calling conventions other than the C calling convention (SysV on x86-64, AAPCS on AArch64).
  • fp128: fneg, fcmp one/ueq, many intrinsics.
  • Computed goto (blockaddress, indirectbr).
  • landingpad with non-empty filter clause.
  • Many intrinsics, and some intrinsics are only implemented for commonly used types (e.g., llvm.cttz only for 8/16/32/64-bit).
  • IFuncs.
  • Various forms of constant expressions in global initializers.
  • Non-empty inline assembly.
  • Full asynchronous unwind info (frame info only correct in prologue and at call sites).
  • Several corner cases that we didn’t encounter so far.

Of these the ones that would be an issue:

  • Scalar types: integer types larger than i64 except i128 (i128 is supported), pointers with non-zero address space, half, bfloat
  • Calling conventions other than the C calling convention (
  • Many intrinsics
  • Non-empty inline assembly.
1 Like

not really if -O0 stays where it is relative to -O3 in terms of performance.

What about -O1? The report claims similar speedups for that optimization level aswell.

Also cross-posting from the other thread I opened (TPDE-LLVM for 10-20x faster -O1 compilation) to provide some more links:

I just saw this post on hackernews, where an LLVM compiler backend (github) was posted that claims to get 10-20x compile time speedup compared to an LLVM 19 baseline.

Quoting from the linked LLVM-Discourse thread:

How to use TPDE-LLVM?

The LLVM back-end is usable as a library (e.g., for JIT compilation, also usable with ORC JIT), as llc-like tool, and can be integrated in Clang (needs a patch, plugins can’t provide a custom back-end right now). Some more details here.

So I am wondering if it would be possible to compile Julia with this LLVM backend, at least for use-cases for which O0 or O1 type performance is acceptable.

I don’t know too much about how exactly LLVM is currently called from Julia, but quoting the TPDE-LLVM docs:

Library usage is possible through tpde_llvm::LLVMCompiler, which supports compiling a module to an object file or mapping it into the memory of the current process for JIT execution. The JIT mapper only supports very typical ELF constructs (e.g., no TLS), if this is not sufficient, the object file can also be mapped through LLVM’s ORC JIT (see tpde-llvm/tools/tpde-lli.cpp for an example).

Perhaps someone can comment on the feasibilty here.

-O1 doesn’t vectorise:

% julia -O1 -q
julia> code_llvm((Float64, Vector{Float64}, Vector{Float64}); debuginfo=:none) do a, x, y
           for idx in eachindex(x, y)
               y[idx] = muladd(a, x[idx], y[idx])
           end
       end
; Function Signature: var"#2"(Float64, Array{Float64, 1}, Array{Float64, 1})
define void @"julia_#2_903"(double %"a::Float64", ptr noundef nonnull align 8 dereferenceable(24) %"x::Array", ptr noundef nonnull align 8 dereferenceable(24) %"y::Array") local_unnamed_addr #0 {
top:
  %"new::OneTo" = alloca [1 x i64], align 8
  %"new::OneTo1" = alloca [1 x i64], align 8
  %"new::Tuple40" = alloca [1 x i64], align 8
  %"new::Tuple42" = alloca [1 x i64], align 8
  %"x::Array.size_ptr" = getelementptr inbounds nuw i8, ptr %"x::Array", i64 16
  %"x::Array.size.0.copyload" = load i64, ptr %"x::Array.size_ptr", align 8
  store i64 %"x::Array.size.0.copyload", ptr %"new::OneTo", align 8
  %"y::Array.size_ptr" = getelementptr inbounds nuw i8, ptr %"y::Array", i64 16
  %"y::Array.size.0.copyload" = load i64, ptr %"y::Array.size_ptr", align 8
  store i64 %"y::Array.size.0.copyload", ptr %"new::OneTo1", align 8
  %.not.not = icmp eq i64 %"y::Array.size.0.copyload", %"x::Array.size.0.copyload"
  br i1 %.not.not, label %L24, label %L21

L21:                                              ; preds = %top
  call void @j_throw_eachindex_mismatch_indices_906(ptr nonnull @"jl_global#907.jit", ptr nocapture nonnull readonly %"new::OneTo", ptr nocapture nonnull readonly %"new::OneTo1") #6
  unreachable

L24:                                              ; preds = %top
  %0 = icmp slt i64 %"x::Array.size.0.copyload", 1
  br i1 %0, label %L104, label %L33

L33:                                              ; preds = %L88, %L24
  %value_phi6 = phi i64 [ %6, %L88 ], [ 1, %L24 ]
  %1 = add i64 %value_phi6, -1
  %"x::Array.size9.0.copyload" = load i64, ptr %"x::Array.size_ptr", align 8
  %.not = icmp ult i64 %1, %"x::Array.size9.0.copyload"
  br i1 %.not, label %L50, label %L46

L46:                                              ; preds = %L33
  store i64 %value_phi6, ptr %"new::Tuple42", align 8
  call void @j_throw_boundserror_905(ptr nonnull %"x::Array", ptr nocapture nonnull readonly %"new::Tuple42") #6
  unreachable

L50:                                              ; preds = %L33
  %"y::Array.size12.0.copyload" = load i64, ptr %"y::Array.size_ptr", align 8
  %.not50 = icmp ult i64 %1, %"y::Array.size12.0.copyload"
  br i1 %.not50, label %L88, label %L65

L65:                                              ; preds = %L50
  store i64 %value_phi6, ptr %"new::Tuple40", align 8
  call void @j_throw_boundserror_905(ptr nonnull %"y::Array", ptr nocapture nonnull readonly %"new::Tuple40") #6
  unreachable

L88:                                              ; preds = %L50
  %memoryref_data = load ptr, ptr %"x::Array", align 8
  %memoryref_byteoffset = shl i64 %1, 3
  %memoryref_data10 = getelementptr inbounds i8, ptr %memoryref_data, i64 %memoryref_byteoffset
  %2 = load double, ptr %memoryref_data10, align 8
  %3 = fmul contract double %"a::Float64", %2
  %memoryref_data14 = load ptr, ptr %"y::Array", align 8
  %memoryref_data22 = getelementptr inbounds i8, ptr %memoryref_data14, i64 %memoryref_byteoffset
  %4 = load double, ptr %memoryref_data22, align 8
  %5 = fadd contract double %3, %4
  store double %5, ptr %memoryref_data22, align 8
  %.not51 = icmp eq i64 %value_phi6, %"x::Array.size.0.copyload"
  %6 = add i64 %value_phi6, 1
  br i1 %.not51, label %L104, label %L33

L104:                                             ; preds = %L88, %L24
  ret void
}
1 Like

Okay, but I think there are many uses cases where vectorization doesn’t matter as much, and TTFX dominates the overall user time.

absolutely agreed.

It might be easier to use this TPDE-LLVM (for LLVM compatibility), or as I suggest, bypass LLVM (like Zig is doing; there no longer default):

1 Like

Linked in that thread I found an insightful comment by @stevengj that seems relevant here, in response to replacing LLVM with a hand-rolled compiler Ă  la Zig.

I agree, and I don’t foresee the Julia compiler team writing their own LLVM replacement; it sure doesn’t seem like they are currently starved for work, especially with 1.12 around the corner. The case of TPDE, though, seems to be exactly the case where we can potentially piggyback off of someone else’s priorities.
In truth, however, I don’t know if this would be a case of “just recompile Julia with TPDE-LLVM” (feasible) or rather a “someone has to start maintaining a new code-generation branch suitable for TPDE-LLVM”, which most likely nobody can commit to long-term at the moment. I would still be interested to hear comments on that.
Finally, of course new funding is usually coupled with new potential applications, and I do believe that a drastically faster -O1 compiler could unlock new industry or research opportunities.

@RomeoV it’s in between. TPDE probably needs some significant improvements before it could be a Julia backend (see my rough list above), but after that, the Julia side should be pretty easy.

2 Likes