Flux model-zoo: Error running vae_mnist.jl

v-i-s-h · August 1, 2020, 8:24am

Hi,

I am trying to run the vae_mnist.jl script from model zoo (https://github.com/FluxML/model-zoo/blob/master/vision/vae_mnist/vae_mnist.jl). However, I am getting the following error. ( I use optirun to enable GPU, but gets same error if I run it on CPU)

$ optirun ~/Applications/julia/release/bin/julia --project=. vae_mnist.jl 
[ Info: Training on GPU
┌ Warning: `DataLoader(x...; kws...)` is deprecated, use `DataLoader(x; kws...)` instead.
│   caller = ip:0x0
└ @ Core :-1
[ Info: Start Training, total 20 epochs
[ Info: Epoch 1
┌ Warning: logitbinarycrossentropy.(ŷ, y) is deprecated, use Losses.logitbinarycrossentropy(ŷ, y, agg=identity) instead
└ @ Flux ~/.julia/packages/Flux/IjMZL/src/deprecations.jl:16
ERROR: LoadError: Compiling Tuple{typeof(Base.Broadcast.broadcasted),typeof(logitbinarycrossentropy),CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,2}}: try/catch is not supported.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] instrument(::IRTools.Inner.IR) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/reverse.jl:89
 [3] #Primal#20 at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/reverse.jl:170 [inlined]
 [4] Zygote.Adjoint(::IRTools.Inner.IR; varargs::Nothing, normalise::Bool) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/reverse.jl:283
 [5] _lookup_grad(::Type{T} where T) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/emit.jl:101
 [6] #s2925#1323 at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/interface2.jl:39 [inlined]
 [7] #s2925#1323(::Any, ::Any, ::Any) at ./none:0
 [8] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any,N} where N) at ./boot.jl:526
 [9] model_loss at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:62 [inlined]
 [10] (::typeof(∂(model_loss)))(::Float32) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/interface2.jl:0
 [11] #9 at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:140 [inlined]
 [12] (::typeof(∂(λ)))(::Float32) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/interface2.jl:0
 [13] (::Zygote.var"#54#55"{Zygote.Params,Zygote.Context,typeof(∂(λ))})(::Float32) at /home/vish/.julia/packages/Zygote/seGHk/src/compiler/interface.jl:177
 [14] train(; kws::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:142
 [15] train() at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:92
 [16] top-level scope at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:173
 [17] include(::Module, ::String) at ./Base.jl:377
 [18] exec_options(::Base.JLOptions) at ./client.jl:288
 [19] _start() at ./client.jl:484
in expression starting at /home/vish/Documents/vlabs/repos/julia-scripts/vae-flux/02/vae_mnist.jl:172

My environment setting is

  [fbb218c0] BSON v0.2.6
  [3895d2a7] CUDAapi v4.0.0
  [634d3b9d] DrWatson v1.14.7
  [587475ba] Flux v0.11.0
  [82e4d734] ImageIO v0.3.0
  [6218d12a] ImageMagick v1.1.5
  [916415d5] Images v0.22.4
  [eb30cadb] MLDatasets v0.5.2
  [d96e819e] Parameters v0.12.1
  [92933f4c] ProgressMeter v1.3.2
  [899adc3e] TensorBoardLogger v0.1.10
  [e88e6eb3] Zygote v0.5.4

What could be the reason for this problem and how to fix this?

ToucheSir · August 2, 2020, 4:26pm

The model zoo hasn’t been updated for Flux 0.11 yet, and that version also made some notable changes (most notably for this use-case, crossentropy loss behaviour has changed). You could give the updated version in https://github.com/FluxML/model-zoo/pull/241 a try. That PR is still WIP though, so further tweaking will likely be required.

v-i-s-h · August 3, 2020, 1:04pm

Hi @ToucheSir,
Thanks for the reference. It did indeed removed the error. But, the model doesn’t seem to train. The loss is staying almost same at 0.6880 over all the epochs and the plotted images appears just blank grey background.

I checked running in both in GPU and CPU, but same results. The only warning I encountered was

┌ Warning: `DataLoader(x...; kws...)` is deprecated, use `DataLoader(x; kws...)` instead.
│   caller = ip:0x0
└ @ Core :-1

which I don’t think has anything to do with untrained models.

ToucheSir · August 3, 2020, 4:52pm

I would leave a comment on the PR mentioning that. As I noted above, the PR is still WIP and there may be outstanding bugs to be iron out.

If you would like to dive into this yourself, try using Zygote.@showgrad to pinpoint any misbehaving gradients. The VAE loss formulation is notoriously complex and finicky, but whatever you find will surely help improve the model-zoo implementation as well.

Topic		Replies	Views
Error when trying to run /FluxML/model-zoo/vision/vae_mnist/vae_mnist.jl on newer versions of CUDA, Flux Machine Learning question , cuda , flux , zygote	1	374	September 24, 2021
Flux.jl: Error thrown during gradient calculation in conv VAE Machine Learning question , debug	4	1309	July 4, 2020
Flux.jl: Convolutional VAE throws error after upgrading to 1.4.1/10.4 Machine Learning debugging	19	1965	July 4, 2020
VAE in Flux model zoo Machine Learning flux	0	832	January 24, 2020
Weird error in Flux model New to Julia question , flux , zygote	4	206	January 11, 2024

Flux model-zoo: Error running vae_mnist.jl

Related topics