Parameters not updating in Flux

I have the following custom layer.

struct Activation{F}
    f::F
    Activation(f::Function) = new{typeof(f)}(f)
end
(m::Activation)(x::AbstractArray) = m.f.(x)

I test whether it allows a model to be trained using the following code

using Flux

function test_training(model,x,y)
    opt = Descent(0.1)
    loss = Flux.Losses.mse
    losses = Vector{Float32}(undef,2)
    for i = 1:2
        local loss_val
        ps = Flux.Params(Flux.params(model))
        gs = gradient(ps) do
            predicted = model(x)
            loss_val = loss(predicted,y)
        end
        losses[i] = loss_val
        Flux.Optimise.update!(opt,ps,gs)
    end
    if losses[1]==losses[2]
        error("Parameters not updating.")
    end
    return nothing
end

x = ones(Float32,4,4,1,1)
y = ones(Float32,4,4,2,1)
model = Chain(Conv((3, 3), 1=>2,pad=SamePad()),Activation(tanh))
test_training(model,x,y)

The model does get trained on my machine (Windows), but for some reasons fails on all operating systems when tested in Github Actions.

The following type unstable variant of the custom layer does work in Github Actions.

struct Activation
    f::Function
end
(m::Activation)(x::AbstractArray) = m.f.(x)

I checked Julia and Flux versions on Github Actions and they are the same as on my machine.

Does anyone have any ideas on what is happening here? A comment on whether it works on your machine or not also helps.

EDIT:

Changing opt = Descent(0.1) to opt = ADAM() allows checks to pass. However, still do not know why it fails when using Descent.

Have you checked if ps contains your parameters? I think you are missing the registration of your model with Flux.functor?

See this for example
functor(::Type{<:Chain}, c) = c.layers, ls → Chain(ls…)

This is the link i wanted to copy paste

Sorry, i am on ipad.

I did test with @functor even though parameters were successfully obtained on my machine, but it did not resolve the issue.

What exactly is the failure you’re running into here? Is it an error? If so, can you post a full stacktrace + MWE with @functor? If not, can you explain in detail what the failure mode on GH actions looks like?

Error condition in the test_training (function in the first post) gets triggered. Meaning that my custom layer somehow prevented training of a model.

Sorry that I cannot help more, I am on vacation till monday away from Real computer.

There are few things I do not understand. Your example is not complete, as there is no model defined. I wonder why you define inner constructor in the first place. You do not need to do that to have type stable model. Can you add the definition of the model?

It is okay, you are trying to help and I appreciate it :slight_smile:

The model is defined in the second field with code. You need to scroll down to see it.

I use such a struct because it allows me to define activation layers as Activation(f::Function) instead of x -> f(x).

If I do not use the inner construct, then the resulting layer is not type stable.

julia> @code_warntype Activation(tanh)(x)
Variables
  m::Activation
  x::Array{Float32, 4}

Body::Any
1 ─ %1 = Base.getproperty(m, :f)::Function
│   %2 = Base.broadcasted(%1, x)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{4}, Nothing, _A, Tuple{Array{Float32, 4}}} where _A
│   %3 = Base.materialize(%2)::Any
└──      return %3

This doesn’t help with understanding what’s going on, but why not simply use tanh as the activation function of the Conv layer?

I need separate activation layers for my project.

When it comes to troubleshooting, I would look at the following:

  1. What are the gradients for each of the params? If they’re nothing or zero, then we should look at the AD side of things.
  2. Relatedly, when losses[1] != losses[2] locally, what is the actual difference? It may be that a nondeterministic version (e.g. something threaded or from XNNPack) is being picked up locally because of available artifacts, CPU feature detection, # of threads, etc. but not on the CI machine.
  3. Does the behaviour persist with a larger learning rate? A 3x3 conv that maps 1 => 2 channels does not have a lot of parameters, so this should be easy to test.

Edit: I just tested locally and got no error.

I am trying to define a custom type stable layer that will allow me to define activations as Activation(f::Function). I tested whether my custom layer worked with Flux by running test_training(model,x,y) function, which runs the model with my layer for two iterations and checks whether it trains. The model did train on my machine, but failed to do that on Github Action triggering the error error("Parameters not updating.") in my testing function. I am now trying to understand why it happened. After some tinkering I found that type unstable variant of the Activation struct does not trigger the error. Changing opt from Descent to ADAM also allows the model to be trained.

I am comparing losses because if the model parameters got updated, then the losses must be different. It is easier than to compare parameters directly.

I am going to try to use more channels to see if this makes a difference as welll as increase the learning rate.

Sure, but it won’t help you root cause this. Please follow the points above and see what outputs you get from Github Actions. If you aren’t already doing so, I’d also run versioninfo() and Pkg.status() as part of the CI run to double check you’re on the right Julia and package versions. If possible, link the runs here so we can look at the full output too.

Any reason for the activation layer to not be a function? It seems like if things are pure functions (without parameters) then there are no parameters to update, and a regular function would do the trick as well

Might not be the problem you are seeing, but in the MWE the Activation only takes arrays, but flux layers always broadcast the activation function so when used in a flux layer it will only see scalars.

Since I have some other layers tested in the same file as Activation I tried removing them to see whether it would make a difference. In that case, for some reason, the model parameters do get updated. Here are the links to the succesful CI run and runtests.jl file: CI, file; and the failed one: CI, file.

I am not sure what to think of the result. Quirks of Github Actions?

1 Like

Based on the CI output and order of operations in the second file, it seems like test_training is failing on this line and not the Activation test. That would also explain why moving the Activation test to another file helped: it was throwing an error before it even got there!

As a side note, I’d recommend looking through Unit Testing · The Julia Language for how to structure test suites and use Julia’s built-in testing functionality. That would’ve helped you figure out which test set was failing instead of having to hunt through the entire file.

Thanks for the link! @testset macro will be very useful here.

Could you tell me how did you end up with the line 48? Based on the stack trace, as I see it, the other line is to blame:

[4] top-level scope @ ~/work/FluxExtra.jl/FluxExtra.jl/test/runtests.jl:142

which tests the activation layer:

[3] test(model::Chain{Tuple{Conv{2, 4, typeof(identity), Array{Float32, 4}, Vector{Float32}}, Activation{typeof(tanh)}}}, x::Array{Float32, 4}, y::Array{Float32, 4}) @ Main ~/work/FluxExtra.jl/FluxExtra.jl/test/runtests.jl:35

My mistake, I thought the @info logging was unconditional for some reason. In that case, my working theory would be that all the updates from the previous calls to test_training leave the weights of test_layer on some kind of local minima or saddle point by the time it reaches the activation test. That could also explain why using Adam fixes the problem, as the additional momentum and adaptive update are meant to help with escaping from such points.

My recommendation is to re-initialize (i.e. re-declare) test_layer and all the other mutable state currently shared between tests for every test. It’s very cheap to do so and will help prevent tests from interfering with one another like may be happening here.

1 Like