Just had a lot of fun implementing VAE (Variational AutoEncoder) with Lux and Reactant, it barely works, little summary of errors,
Reactant : errors with:
'stablehlo.transpose' op using value defined outside the region
ERROR: "failed to run pass manager on module"
CPU with Enzyme errors with
Duplicated(Decoder,RefValue) error
note : gradient can be calculated when using Reactant, but not when using cpu, however training still fail
works with AutoZygote() on cpu, did not try on gpu.
repo : GitHub - yolhan83/MLX_exemples_julia_reactant: trying to implement some of the "mlx-examples" repo code
I know I should make a mwe from that but that may be hard, will see. If anyone has ideas they are welcome.
1 Like
I encountered similar errors before, I believe it typically means you are mixing different floating point precisions, which Reactant cannot handle ATM. Looking at your code, it looks like this line specifically could be the issue, as it’s using 0.5
which is of type Float64
and it looks like the rest of your code is using Float32
. Otherwise it would help to see the whole stacktrace
1 Like
thanks, yes that could be bad, still not working same error, here is the full error
err.jl (85.0 KB)
btw loss and loss gradient compile fine its really in the optimisation process
If you look at the stacktrace, you can still see ConcreteRNumber{Float64}
in there, meaning you are still promoting to Float64
somewhere. Since you say this is only in the optimization process, my best guess is that you need to make your learning rate in the Adam optimizer a Float32
as well, so replace 1e-3
with 1f-3
The transpose issue is definitely a bug and shouldn’t happen (regardless of mixed precision or not, which should be fine?)
please open an issue with a reproducer (ideally with any amount of reduction for where it’s coming from)
Also cc @avikpal
Is this on the latest releases of Lux and Reactant?
I also happen to have a partial implementation of CVAE from the MLX repo Lux.jl/examples/ConditionalVAE/main.jl at ap/cvae2 · LuxDL/Lux.jl · GitHub, but probably needs to be updated
Yes it is the latest on both and your code looks very similar to mine, is it working fine ?
Thank you I was getting crazy trying to find where those are, I will try to make the mwe
yes, it’s training fine for the most part. There are some of the usual issues of VAEs with NaNs, which I am trying to sort out
Thank you I’ve made mine work too by putting layers together in Chain instead of having them separate in the struct did not try to make the mwe yet though will see
the transpose issue has been resolved in the latest releases of Reactant (v0.2.17) and Lux (v1.5)