Flux model-zoo: Error running vae_mnist.jl


I am trying to run the vae_mnist.jl script from model zoo (https://github.com/FluxML/model-zoo/blob/master/vision/vae_mnist/vae_mnist.jl). However, I am getting the following error. ( I use optirun to enable GPU, but gets same error if I run it on CPU)

$ optirun ~/Applications/julia/release/bin/julia --project=. vae_mnist.jl 
[ Info: Training on GPU
┌ Warning: `DataLoader(x...; kws...)` is deprecated, use `DataLoader(x; kws...)` instead.
│   caller = ip:0x0
└ @ Core :-1
[ Info: Start Training, total 20 epochs
[ Info: Epoch 1
┌ Warning: logitbinarycrossentropy.(ŷ, y) is deprecated, use Losses.logitbinarycrossentropy(ŷ, y, agg=identity) instead
└ @ Flux ~/.julia/packages/Flux/IjMZL/src/deprecations.jl:16
ERROR: LoadError: Compiling Tuple{typeof(Base.Broadcast.broadcasted),typeof(logitbinarycrossentropy),CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,2}}: try/catch is not supported.
 [1] error(::String) at ./error.jl:33
 [2] instrument(::IRTools.Inner.IR) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/reverse.jl:89
 [3] #Primal#20 at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/reverse.jl:170 [inlined]
 [4] Zygote.Adjoint(::IRTools.Inner.IR; varargs::Nothing, normalise::Bool) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/reverse.jl:283
 [5] _lookup_grad(::Type{T} where T) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/emit.jl:101
 [6] #s2925#1323 at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/interface2.jl:39 [inlined]
 [7] #s2925#1323(::Any, ::Any, ::Any) at ./none:0
 [8] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any,N} where N) at ./boot.jl:526
 [9] model_loss at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:62 [inlined]
 [10] (::typeof(∂(model_loss)))(::Float32) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/interface2.jl:0
 [11] #9 at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:140 [inlined]
 [12] (::typeof(∂(λ)))(::Float32) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/interface2.jl:0
 [13] (::Zygote.var"#54#55"{Zygote.Params,Zygote.Context,typeof(∂(λ))})(::Float32) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/interface.jl:177
 [14] train(; kws::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:142
 [15] train() at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:92
 [16] top-level scope at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:173
 [17] include(::Module, ::String) at ./Base.jl:377
 [18] exec_options(::Base.JLOptions) at ./client.jl:288
 [19] _start() at ./client.jl:484
in expression starting at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:172

My environment setting is

  [fbb218c0] BSON v0.2.6
  [3895d2a7] CUDAapi v4.0.0
  [634d3b9d] DrWatson v1.14.7
  [587475ba] Flux v0.11.0
  [82e4d734] ImageIO v0.3.0
  [6218d12a] ImageMagick v1.1.5
  [916415d5] Images v0.22.4
  [eb30cadb] MLDatasets v0.5.2
  [d96e819e] Parameters v0.12.1
  [92933f4c] ProgressMeter v1.3.2
  [899adc3e] TensorBoardLogger v0.1.10
  [e88e6eb3] Zygote v0.5.4

What could be the reason for this problem and how to fix this?

The model zoo hasn’t been updated for Flux 0.11 yet, and that version also made some notable changes (most notably for this use-case, crossentropy loss behaviour has changed). You could give the updated version in https://github.com/FluxML/model-zoo/pull/241 a try. That PR is still WIP though, so further tweaking will likely be required.

Hi @ToucheSir,
Thanks for the reference. It did indeed removed the error. But, the model doesn’t seem to train. The loss is staying almost same at 0.6880 over all the epochs and the plotted images appears just blank grey background.

I checked running in both in GPU and CPU, but same results. The only warning I encountered was

┌ Warning: `DataLoader(x...; kws...)` is deprecated, use `DataLoader(x; kws...)` instead.
│   caller = ip:0x0
└ @ Core :-1

which I don’t think has anything to do with untrained models.

I would leave a comment on the PR mentioning that. As I noted above, the PR is still WIP and there may be outstanding bugs to be iron out.

If you would like to dive into this yourself, try using Zygote.@showgrad to pinpoint any misbehaving gradients. The VAE loss formulation is notoriously complex and finicky, but whatever you find will surely help improve the model-zoo implementation as well.