Flux.jl Inconsistent Training on Custom Architecture

I am currently implementing a toy version of the architecture as found in:
Deep learning for universal linear embeddings of nonlinear dynamics | Nature Communications , which is effectively an encoder → affine → decoder process.

I have already implemented the architecture and have been testing on single trajectory datasets. This is the struct I am employing as the model:

struct A
    encoder
    decoder
    K
end

where the encoder and decoder are MLPs and K is simply a matrix. In order to move my testing on to multiple trajectories, I am employing this model:

struct B
   encoder
   decoder
   K::Vector
end

In this case, K in this model holds one of the K’s in the above model per trajectory. So, these two architectures should be equivalent if training on a single trajectory dataset.

However, I am not getting this expected behavior. For a single trajectory dataset, A fits well while B does not converge whatsoever no matter the training time or parameters used. In fact, the loss almost stays constant throughout training but jitters somewhat.

While attempting to print the norm of the gradient for B, I get an error which suggest to me that the gradient is empty. Here is an MWE:

using Flux
using LinearAlgebra

Xs = [rand(2,10)]

mutable struct B
    mlp
    K::Vector
end

function loss(Xs,model)
    l = 0
    for i in 1:length(Xs)
        l += norm(model.K[i])
    end
    return l
end

model = B(Dense(2,2),[rand(2,2)])
ps = params(model.mlp,model.K)
gs = gradient(() -> loss(Xs,model),ps)
norm(gs)

which gives the following error:

**ERROR:** MethodError: no method matching iterate(::Nothing)

Closest candidates are:

iterate(::DataStructures.TrieIterator) at /Users/tylerhan/.julia/packages/DataStructures/ixwFs/src/trie.jl:112

iterate(::DataStructures.TrieIterator, ::Any) at /Users/tylerhan/.julia/packages/DataStructures/ixwFs/src/trie.jl:112

iterate(::Cmd) at process.jl:638

...

Stacktrace:

[1] **isempty(** ::Nothing **)** at **./essentials.jl:737**

[2] **norm(** ::Nothing, ::Int64 **)** at **/Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/generic.jl:605** (repeats 2 times)

[3] **generic_normInf(** ::Zygote.Grads **)** at **/Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/generic.jl:446**

[4] **normInf** at **/Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/generic.jl:536** [inlined]

[5] **generic_norm2(** ::Zygote.Grads **)** at **/Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/generic.jl:477**

[6] **norm2** at **/Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/generic.jl:538** [inlined]

[7] **norm(** ::Zygote.Grads, ::Int64 **)** at **/Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/generic.jl:607**

[8] **norm(** ::Zygote.Grads **)** at **/Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/generic.jl:605**

[9] top-level scope at **REPL[9]:1**

I am not certain of this, but perhaps this means that the second model’s gradient is empty? I do not get such errors for the first model.

I’d like to know if there is a fix for such unexpected behavior or perhaps there is a better way to implement the second model (I have looked at Flux’s “advanced model building” section but it was not obvious to me how it might help).

I am currently using Flux v0.12.1.

Please see Please read: make it easier to help you. We can try to work out the issue by speculating about it, but the best way to get a resolution is to post a MWE that demonstrates it. For Flux stuff, that includes dummy data, library versions used and a full stacktrace of any relevant errors.

I see! Will do. Thanks for letting me know.

The description has been updated! Please let me know if there are other things I could add which might help.

Can you also provide an end-to-end, minimal running example that throws that error (i.e. a MWE)? As-is there’s nothing inherent in the definition of either struct that would cause an issue, so having that full context is important.

Edited. Thank you for bearing with me!

I have found a workaround by simply augmenting K, but for the future’s sake, I would still like to know what exactly here is not permissible.

gradient with implicit params (i.e. what params returns) will return a Grads struct. This is not an array, but more a bag of arrays. To calculate the norm of that, you’d have to iterate through it and calculate the norm of each element. Note that this also requires checking for nothing, as that’s the value used for parameters that aren’t involved in the gradient calculation.

As for your workaround, I’m not sure what “augmenting K” means. Could you post an updated example with this augmentation that works as you’d expect?

I see, but I don’t get an error involving nothing for the first model. And in the second model, all the parameters should be used for the gradient.

By “augmenting K”, I just mean simply stacking the corresponding K matrices, as in:

newK = vcat(oldK...)

and I identify the relevant K matrices in the loss function through slicing. I suppose this workaround confirms that there is something I don’t understand about wrapping them in a Vector.

I was referring to the MWE, in which only model.K is used. Incidentally, the gradient there is nothing as well in the MWE because of an interaction between implicit parameters (i.e. the thing you get from params and pass to gradient) and the AD (Zygote). c.f:

julia> using Zygote, LinearAlgebra

julia> X = [rand(2, 2)]
1-element Vector{Matrix{Float64}}:
 [0.6202008269430048 0.8555679159356662; 0.8423362289177463 0.09771425479926421]

# 1. this doesn't work
julia> gradient(() -> norm(X[1]), Params(X)).grads
IdDict{Any, Any} with 2 entries:
  [0.620201 0.855568; 0.842336 0.0977143] => nothing  # gradient wrt. X[1]
  :(Main.X)                               => Union{Nothing, Matrix{Float64}}[[0.45775 0.631467; 0.621701 0.0721198]]

# 2. but this does
julia> gradient(() -> norm(X[1]), Params([X])).grads
IdDict{Any, Any} with 2 entries:
  :(Main.X)                                 => Union{Nothing, Matrix{Float64}}[[0.45775 0.631467; 0.621701 0.0721198]]
  [[0.620201 0.855568; 0.842336 0.0977143]] => Union{Nothing, Matrix{Float64}}[[0.45775 0.631467; 0.621701 0.0721198]]

# 3. as does this
julia> x₁ = X[1]
2×2 Matrix{Float64}:
 0.620201  0.855568
 0.842336  0.0977143

julia> gradient(() -> norm(x₁), Params(X)).grads
IdDict{Any, Any} with 2 entries:
  [0.620201 0.855568; 0.842336 0.0977143] => [0.45775 0.631467; 0.621701 0.0721198]
  :(Main.x₁)                              => [0.45775 0.631467; 0.621701 0.0721198]

# 4. and this (note explicit instead of implicit parameters.
# That is, we pass X directly and use it instead of params(X)). 
# This works with full Flux models too!
julia> gradient(x -> norm(x[1]), X)[1]
1-element Vector{Union{Nothing, Matrix{Float64}}}:
 [0.4577503206426886 0.6314672132598408; 0.6217013298363201 0.07211975463850043]

The gist is that params (and Params, if given a single argument) splat their arguments into an underlying IdDict:

julia> params(X).order[1]
2×2 Matrix{Float64}:  # this is X[1], where you'd expect it to be X itself
 0.620201  0.855568
 0.842336  0.0977143

For whatever reason, Zygote isn’t smart enough to link the X[1] in the loss to the actual value of X[1] in the params. You can see I avoid this in 2) and 3) by stopping X from being unravelled and hoisting the declaration of X[1] into a variable respectively.