Pathological compile times training a Lux+CUDA descriptor CNN with a similarity-matrix (SupCon) loss — Reactant, Zygote, and manual loops all hang on first step

prittjam · April 25, 2026, 8:05pm

 --- Body:

I’ve been trying to train a small descriptor CNN (~1M params, 64×64
grayscale patch input → 128-dim L2-normalized embedding) with a
Supervised Contrastive (SupCon) loss in pure Julia / Lux on a single
B200 GPU, and have hit what looks like a fundamental compile-time
wall in every Julia code path I’ve tried. I’d appreciate a sanity
check on whether this is expected for the stack, or whether I’m
holding something wrong.

Versions: Julia 1.12.5, Lux 1.31.3, Reactant 0.2.x, Enzyme 0.13.134,
Zygote 0.7, CUDA 5.x, cuDNN 1.4.7 (artifact 9.20). All running under
SLURM sbatch on a fresh process (no warm REPL).

Workload: - Data: 3.6M training patches (66 GB on disk as JLD2), 56k
blob-identity classes, plus 709k confuser patches. - Batch
composition (tuned for HBM3e): 384 anchor blobs × 10 views + 2048
confusers = 5888 patches per step. Patch size 64×64 Float32. Network
is a flat Chain of ~25 layers (Conv→BatchNorm pattern × 6 with 2×
downsamples + head). - Loss: vectorized SupCon — S = embeddings’ *
embeddings, build (N,N) Bool masks for self/positive, log-sum-exp
per row, average log-prob over positives. No scalar indexing, no
mutation, no Python-style loops.

The wall — three independent paths, all hung past the first step:

Reactant XLA + AutoEnzyme via single_train_step! (the path Lux’s
official ResNet20 tutorial uses): driver entered
single_train_step!, Reactant’s XLA service initialized cleanly on
the B200, BFC pre-allocated 143 GB on device 0, cuDNN 9.14
loaded. Then 45+ minutes of 100% single-thread CPU, no GPU memory
ever allocated, no first-step output. Backtrace via SIGUSR1
confirmed it was inside
single_train_step_impl_with_allocator_cache! — pure Reactant
tracing/lowering of the SupCon backward graph. 2. Eager CUDA +
Zygote, manual Zygote.withgradient + Optimisers.update! loop (per
Lux #1704 maintainer-suggested workaround for single_train_step!
overhead): same wall. 1+ hour 100% CPU, RSS plateaued at 92 GB
after data load, no GPU memory, no first step. Backtrace showed
pure type-inference / LLVM codegen with no GC yields between
SIGUSR1 and observation 50 minutes later. 3. Reactant @compile
model(x, ps, st) forward warm-up + single_train_step!(AutoEnzyme())
(matches the tutorial pattern more precisely): the @compile warm-up
actually finished (~17 min). Then single_train_step!'s first call
started compiling the backward and again sat at 100% CPU with no
output for the next 40+ minutes. Caching the forward HLO doesn’t
shortcut Enzyme’s backward derivation.

Things I’ve ruled out: - It’s not a hang — ps shows 99-100% CPU
steadily, RSS stable, the process is making progress, just very
slowly. - Not a data-load issue — Loading training data… and
Model: 1057696 params, batch=5888, … both print before the wall.

Not the loss — supcon_loss_mat is fully vectorized broadcast over
(N,N) matrices, no scalar indexing, AD-friendly. - Not BatchNorm
specifically (tutorial uses BN too). - Not nested Chains
specifically (I flattened them per the ResNet20 tutorial pattern —
same wall). - Not Lux.Training specifically (manual Zygote.gradient

Optimisers.update! hits the same wall).

What seems to be the common factor: the cost of generating
reverse-mode code for (::Function, ::TrainState{Model, Params,
State, Opt, OptState}) (or the equivalent closure type for the
manual loop), where the captured model is a Chain of ~25 layers and
Params/State are deeply nested NamedTuples. Lux #1484 documents a
200k-param parallel-CNN model taking 11+ min via single_train_step!

Reactant + Enzyme; we’re at 1M params with a similarity-matrix
loss on top, so 60+ min isn’t even surprising in that context. Lux
#1704 explicitly acknowledges single_train_step!'s
shape-specialization overhead.

Searches I’ve done that didn’t help: - No Julia/Flux/Lux
implementation of SupCon, NTXent, or InfoNCE exists in any FluxML or
LuxDL repo, model zoo, or tutorial. Flux only has the pairwise
siamese_contrastive_loss. The contrastive learning workload appears
genuinely unimplemented in pure Julia. - PackageCompiler + CUDA +
Lux has documented historical issues (Flux #1337 — sysimage GPU
access violations, Discourse 2020 Zygote/PackageCompiler
incompatibility). The Julia community has broadly moved to
PrecompileTools, but those caches don’t cover our specific call
signature. - Reactant #1990 flags Lux + Conv layers + Julia 1.12 as
fragile (different error mode than ours, but same combo).

The questions I’d love community input on:

Is anyone actually training a contrastive / large-batch metric
learning model in Julia (Lux or Flux) end-to-end on GPU? If yes —
what stack, what model size, what batch size, how long does
first-step compile take? 2. Are there model-architecture or
loss-formulation idioms that specifically avoid the compile-time
wall on similarity-matrix losses, beyond “smaller batch, fewer
layers”? 3. Is there a way to get Reactant to cache the compiled
training step across processes (sysimage equivalent for HLO), so we
pay the wall once per machine rather than once per sbatch? @compile
artifacts seem to be in-process only. 4. Is the realistic answer
“use PythonCall + PyTorch for this specific workload”? It feels like
it given the search evidence, but I’d love to hear if anyone has
gotten contrastive training working in pure Julia and what it took.

Setup, model code, loss code, and full backtraces available on
request — happy to share a minimal repro if the symptoms sound
resolvable.

Thanks in advance.

yolhan_mannes · April 25, 2026, 8:25pm

Are you on windows or Linux ? Of you’re on windows, Reactant only gives CPU support which would explain the 100% cpu usage.

Also without a mwe we can’t help a lot.

on my side this works just fine

using Lux, Reactant, Enzyme, Optimisers
using Random

Reactant.set_default_backend("gpu")
dev = Lux.reactant_device()

model = Lux.Chain(
    Conv((3,3),1=>128,pad=1,tanh),
    MaxPool((2,2)),
    Conv((3,3),128=>128,pad=1,tanh),
    MaxPool((2,2)),
    Conv((3,3),128=>256,pad=1,tanh),
    MaxPool((2,2)),
    Conv((3,3),256=>256,pad=1,tanh),
    MaxPool((2,2)),
    FlattenLayer(),
    Dense(256*4,1)
)
x = rand(Float32,32,32,1,1_000);
xr = Reactant.to_rarray(x);
ps,st = Lux.setup(Random.default_rng(),model) |> dev
val = @jit model(xr,ps,st)
val[1]

I only have a 4060 so can’t use too much ram but if you’re sure you have one use it
no issue with gradient either


loss(model,x,y,ps,st) = Lux.MSELoss()(model(x,ps,st)[1],y)

y = rand(Float32,1,1_000);
yr = Reactant.to_rarray(y);
dps = Enzyme.make_zero(ps);
function get_grad(model,xr,yr,ps,dps,st) 
    Enzyme.make_zero!(dps);
    Enzyme.autodiff(Enzyme.Reverse,loss,Const(model),Const(xr),Const(yr),Duplicated(ps,dps),Const(st))[1]
end
@jit get_grad(model,xr,yr,ps,dps,st)
dps

with this in mind, training should go fine :

opt = Lux.Training.TrainState(model,ps,st,Adam(1f-4))
Lux.Training.single_train_step!(AutoEnzyme(),Lux.MSELoss(),(xr,yr),opt)

it does take a long time to compile it seems like ( even hangs ).

A workaround is :

opt = @jit Optimisers.setup(Adam(Reactant.to_rarray(1f-4;track_numbers=true)),ps)
@jit Optimisers.update!(opt,ps,dps)

I used jit everywhere, but remember to compile the grad and the update and use them afterward, in your training loop, add some gc calls every X iteration just to make sure

edit : should use Lux.Training.single_train_step!(AutoReactant(),Lux.MSELoss(),(xr,yr),opt) which works like a charm

mofeing · April 26, 2026, 11:49am

Knowing patch version of Reactant is critical because currently it doesn’t really follow SemVer (we want to change that in the future) so there are big changes between patch versions.

Then 45+ minutes of 100% single-thread CPU, no GPU memory ever allocated, no first-step output.

Backtrace showed pure type-inference / LLVM codegen with no GC yields between SIGUSR1 and observation 50 minutes later.

This issue can be dampen by disabling Julia-side LLVM optimizations by appending -O0 to your julia command.

You should see a improvement on Reactant tracing by running @time @code_hlo optimize=false f(x).

Then single_train_step!'s first call
started compiling the backward and again sat at 100% CPU with no
output for the next 40+ minutes.

What seems to be the common factor: the cost of generating
reverse-mode code for (::Function, ::TrainState{Model, Params,
State, Opt, OptState}) …

Do you mind doing some profiling? If you could run the following code and send us the profile data, we could figure it out which passes are taking so long to compile

using Reactant, Profile, Serialization

# decrease sampling rate to 1 sample/sec in order to avoid overflowing the buffer
Profile.init(; delay=1.0)

# compile your function with Reactant
@profile @compile f(x)

# serialize profile data
data, lidict = Profile.retrieve()
lidict = collect(lidict) # there can be ordering issues so convert to Vector{Pair}
serialize("data.bin", data)
serialize("lidict.bin", lidict)

Topic		Replies	Views
Significant compile time latency in Flux with a GAN Machine Learning compilation , flux	28	3774	November 29, 2021
High memory usage when training a custom Lux model General Usage zygote , lux	8	230	December 26, 2025
Lux/Enzyme error when Training my model? Machine Learning enzyme , lux	9	394	November 2, 2025
Reactant.jl compile times and memory usage Machine Learning reactant	9	405	September 25, 2025
Trouble using @compact along reactant.jl Machine Learning	4	103	June 8, 2026

Pathological compile times training a Lux+CUDA descriptor CNN with a similarity-matrix (SupCon) loss — Reactant, Zygote, and manual loops all hang on first step

Related topics