After reading a few posts (and some suggestion from @wmshen0323 to try Enzyme), Reactant got my attention. My understanding is that, if I manage to compile my code with Reactant, I get for free to run my code on GPUs and I can differentiate that with Enzyme.
I did a few tests on my (very old) laptop and I was quite impressed.
Small, successful example
using Reactant, Tullio, LoopVectorization, FFTW, BenchmarkTools, LinearAlgebra
import AbstractFFTs
# --- 1. THE ORIGINAL PIPELINE (Pre-planned) ---
function full_pipeline_original(vals, T, plan)
# Apply pre-computed plan
coefs = plan * vals
s = size(coefs)
coefs ./= 2*(s[1]-1)
# Interior doubling (optimized with broadcasting for comparison)
weights = ones(Float64, s[1])
weights[2:end-1] .= 2.0
coefs .*= reshape(weights, (s[1], 1, 1))
# Contract
@tullio w[i,j,k] := coefs[j,k,l] * T[i,j,k,l]
return w
end
# --- 2. THE REACTANT PIPELINE ---
function reactant_dct1_equivalent(x)
n = size(x, 1)
# Traceable mirroring logic
mirrored = vcat(x, reverse(x[2:end-1, :, :], dims=1))
transformed = real.(AbstractFFTs.fft(mirrored, 1))
return transformed[1:n, :, :]
end
function fast_chebcoefs_reactant(vals)
coefs = reactant_dct1_equivalent(vals)
s1 = size(coefs, 1)
weights = ones(Float64, s1)
weights[2:end-1] .= 2.0
scale = 1.0 / (2 * (s1 - 1))
return (coefs .* scale) .* reshape(weights, (s1, 1, 1))
end
function full_pipeline_reactant(vals, T)
c = fast_chebcoefs_reactant(vals)
c_reshaped = reshape(c, 1, size(c)...)
return dropdims(sum(c_reshaped .* T; dims=4); dims=4)
end
# --- 3. DATA SETUP ---
I, J, K, L = 20, 96, 48, 128
vals_cpu = rand(Float64, J, K, L)
T_cpu = rand(Float64, I, J, K, L)
# Pre-compute the FFTW plan
# Use PATIENT to give FFTW the best possible chance
println("Planning FFTW...")
plan = FFTW.plan_r2r(copy(vals_cpu), FFTW.REDFT00, [1]; flags=FFTW.PATIENT)
# Setup Reactant data and Compile
vals_ra = Reactant.Array(vals_cpu)
T_ra = Reactant.Array(T_cpu)
println("Compiling Reactant...")
compiled_pipe = @compile full_pipeline_reactant(vals_ra, T_ra)
# --- 4. VERIFICATION ---
res_orig = full_pipeline_original(vals_cpu, T_cpu, plan)
res_ra = compiled_pipe(vals_ra, T_ra)
diff = norm(Array(res_ra) - res_orig) / norm(res_orig)
println("\nVerification Relative Difference: $diff")
# --- 5. THE COMPARISON ---
println("\n--- BENCHMARK: ORIGINAL (Pre-planned FFTW + Tullio) ---")
display(@benchmark full_pipeline_original($vals_cpu, $T_cpu, $plan))
println("\n--- BENCHMARK: REACTANT (Compiled XLA Pipeline) ---")
display(@benchmark $compiled_pipe($vals_ra, $T_ra))
with results
--- BENCHMARK: ORIGINAL (Pre-planned FFTW + Tullio) ---
julia> display(@benchmark full_pipeline_original($vals_cpu, $T_cpu, $plan))
BenchmarkTools.Trial: 316 samples with 1 evaluation per sample.
Range (min β¦ max): 14.283 ms β¦ 21.669 ms β GC (min β¦ max): 0.00% β¦ 21.24%
Time (median): 15.725 ms β GC (median): 0.00%
Time (mean Β± Ο): 15.810 ms Β± 1.110 ms β GC (mean Β± Ο): 1.24% Β± 2.98%
β βββ
βββ ββ
ββββββββββββββββββββββββββββββββββββ
β
βββββ
βββββββββββββββββ β
14.3 ms Histogram: frequency by time 18.6 ms <
Memory estimate: 5.21 MiB, allocs estimate: 56.
julia> println("\n--- BENCHMARK: REACTANT (Compiled XLA Pipeline) ---")
--- BENCHMARK: REACTANT (Compiled XLA Pipeline) ---
julia> display(@benchmark $compiled_pipe($vals_ra, $T_ra))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min β¦ max): 51.366 ΞΌs β¦ 1.502 ms β GC (min β¦ max): 0.00% β¦ 90.80%
Time (median): 60.232 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 72.141 ΞΌs Β± 93.269 ΞΌs β GC (mean Β± Ο): 13.85% Β± 9.98%
ββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
51.4 ΞΌs Histogram: log(frequency) by time 842 ΞΌs <
Memory estimate: 720.09 KiB, allocs estimate: 3.
I was quite impressed!
This basically improves our previous results by a factor of 10β¦basically for free! Now, I have a few questions.
-
Autodiff . Is it true that, if I can compile my program with
Reactant, I will be able to differentiate withEnzyme(of course there might be edge cases, letβs say for a standard scenario). Can I/should I write some rrules or can I assume the performance is good in most cases? If not, can I just write rrules usingEnzymethe usual way? -
SciML & ODE . An important part of my workflow is represented by solving ODEs with the SciML stack. Can I compile that part as well, or that one will stay outside of the compilation?
-
Turing. After producing all of these high performance codes, I usually put them in some pipeline that leverages
Turingto either run chains or compute bestfits. I usually donβt do weird thing on the Turing side (Uniform, Normal priors, eventually I use the submodels syntax). Can I expect that these pieces of Turing are gonna work ok with Enzyme? -
Compilation. One last point: compilation times. While I am more than happy with the final performance of the code, I wonder if there is anything I can do to reduce the compilation times. For some of the tests I am doing, rewriting some legacy codes, I see compilation times of order 20 minutes, that are more than compensated by the final performancem, but I wonder if there is any best practice or approach to reduce that compilation time.
Thanks in advance!