Flux - sigmoid in last layer destroys learning?

BioTurboNick · October 11, 2024, 3:54am

I’m fairly new to ML. But my understanding is that inner layers should generally use relu, and the final layer should use whatever function constrains the output to the range you want. In my case, sigmoid to constrain between 0 and 1. But that sigmoid seems to be trouble.

I threw together something simple - an array with pixels set to 0.25, and a couple in the middle set to 0.75. The target to learn is the same array with values of 0 and 1, respectively.

Without any activation functions in the toy network (smaller than the complex network I actually want to train, but which I’m trying to troubleshoot), after about 1000 iterations it’s pretty close to the target.

If I add relu to the first and second layers and leave the last as identity, it reaches the same loss after 400 iterations.

If I then add sigmoid to the last layer, training completely breaks down. I end up with hot pixels at each corner of the network output, and loss plateaus early.

What is happening here? And are there any tools I should be aware of that would make it easy for me to inspect what’s happening and understand it myself?

using Flux
using Plots

function train_test_network(;
    niters = 100)

    a = fill(0.25f0, 40, 40, 1, 1)
    a[4, 4, 1, 1] = 0.75f0
    a[10, 10, 1, 1] = 0.75f0

    target = zeros(Float32, size(a)...)
    target[4, 4, 1, 1] = 1f0
    target[10, 10, 1, 1] = 1f0

    network = Chain(
        Conv((3, 3), 1 => 48, relu; pad = (1, 1)),
        Conv((3, 3), 48 => 48, relu; pad = (1, 1)),
        Conv((1, 1), 48 => 1, σ)
    )

    opt = Adam()
    opt_state = Flux.setup(opt, network)
    for i ∈ 1:niters
        Flux.train!(network, ((a, target),), opt_state) do m, x, y
            y1 = m(x)
            loss = Flux.mse(y1, y)
            return loss
        end

        if (i - 1) % 50 == 0
            y1 = network(a)
            loss = Flux.mse(y1, target)
            display(heatmap(y1[:, :, 1, 1], aspectratio = 1))
            @info loss
        end
    end
    return network
end

sylvaticus · October 11, 2024, 11:32am

You are using a single example to train the network!
And in your example there is no spatial correlation/structure… rather than convolution layers, you should use normal layers.

For example (using BetaML where I am more at ease):

using BetaML, Statistics

a                    = fill(0.25, 40*40)
a[[4*4,10*10]]      .= 0.75
target               = zeros(40*40)
target[[4*4,10*10]] .= 1.0

layers     = [DenseLayer(1,3,f=relu), DenseLayer(3,1,f=sigmoid)]
nnm        = NeuralNetworkEstimator(layers=layers,batch_size=1)
a          = makematrix(a)
target     = makematrix(target)
target_hat = fit!(nnm, a, target)

l2 = mean((target - target_hat).^2) #1.39e-10

target_hat[4*4,1]   # 0.999989
target_hat[4*4+1,1] # 1.116e-5
target_hat[10*10,1] # 0.999989

BioTurboNick · October 11, 2024, 12:44pm

I am aware it’s using a single example. It’s a toy to illustrate the problem. I have to use convolutions in the real complex network this is a miniature of.

Please focus on the specific question I asked and don’t try to offer different approaches.

roflmaostc · October 11, 2024, 2:15pm

I think the issue is that σ maps to (0, 1) so the gradients are vanishing because sigmoid never gets exactly 0 but your target is 0.

Changing the target to something in the output range of sigmoid works for me:

target = zeros(Float32, size(a)...) .+ 0.01f0

greatpet · October 11, 2024, 2:24pm

I can get it working by generating 100 random datasets, each being a 40×40 array of numbers that are either 0.25 or 0.75. The targets are obtained by mapping 0.25 to 0 and 0.75 to 1. After training is complete, the model gives a reasonably good prediction for @BioTurboNick’s original test data. Full code here:

using Flux
using Plots

function train_test_network(;niters = 20)

    a = rand((0.25f0, 0.75f0), 40, 40, 1, 100)
    target = map(x -> x < 0.5 ? 0.0f0 : 1.0f0, a)

    network = Chain(
        Conv((3, 3), 1 => 48, relu; pad = (1, 1)),
        Conv((3, 3), 48 => 48, relu; pad = (1, 1)),
        Conv((1, 1), 48 => 1, σ)
    )

    loader = Flux.DataLoader((a, target), batchsize=20, shuffle=true)
    opt = Adam()
    opt_state = Flux.setup(opt, network)
    for i ∈ 1:niters
        for (x, y) in loader
            Flux.train!(network, ((a, target),), opt_state) do m, x, y
                y1 = m(x)
                loss = Flux.mse(y1, y)
                @show loss
                return loss
            end
        end

    end
    return network
end

network = train_test_network()
test_data = fill(0.25f0, 40, 40, 1, 1)
test_data[4, 4, 1, 1] = 0.75f0
test_data[10, 10, 1, 1] = 0.75f0
prediction = network(test_data)
heatmap(prediction[:, :, 1, 1], aspectratio = 1)

Here’s the heatmap for the prediction:

I believe the padding of the CNN is not ideal, since all the data consist of 0.25 and 0.75. You should probably pad with either of these numbers rather than the default 0.

BioTurboNick · October 11, 2024, 2:42pm

Vanishing means the gradients should get extremely small, right?

That doesn’t seem to be what I’m seeing. Starting from the exact same network weights and biases (Random.seed!(113) before creating the network), the gradients calculated for the version with sigmoid are much larger by several orders of magnitude, and often with a different sign.

And fwiw, your suggestion didn’t really work that well. Over 900 iterations it is producing nonsense, and then finally it produces something that is close but the high values are much too low.

My goal here is to understand why it’s behaving the way it is, and how design choices in NNs impact training. At the moment I’m wondering why anyone uses a sigmoid, instead of just training with identity and then clamping the outputs in production.

BioTurboNick · October 11, 2024, 2:55pm

I have been wondering about the padding myself. Apparently padding modes are not built-in to Flux yet though, so when I just let the convolutions crop, I end up with a paradoxical inverted result… which then suddenly flips to the right orientation after ~3000 iterations and slowly adjusts to … not quite the right values after 10,000 iterations.

I’m guessing the latter behavior is because the gradients are so small near 0 and 1? I don’t think I understand the earlier behavior though.

greatpet · October 11, 2024, 3:02pm

Probably someone with expertise in image segmentation can offer more insights. From what little I know, image segmentation requires classifying every pixel (e.g. object vs. background) so is similar to the models discussed here.

roflmaostc · October 11, 2024, 3:03pm

Yeah but the issue is also that your upper target was 1.

For example, doing

     target = zeros(Float32, size(a)...) .+ 0.1f0
     target[4, 4, 1, 1] = 1f0 .- 0.1f0
     target[10, 10, 1, 1] = 1f0 .- 0.1f0

     ....
     opt = Adam(0.003)

does the boundary detour first but then converges eventually quite well.

If you are close to the limits of sigmoids the gradient should be relatively small. And since most values are small, I believe Adam fails to converge your single spikes well.

What helps is could be also a normalization of the target values.

If you want to map between [0,1] then a clipped ReLU should be totally fine too. And probably works much better here.

ToucheSir · October 11, 2024, 3:05pm

Built-in Layers · Flux ? Or were you looking for something else?

BioTurboNick · October 11, 2024, 3:06pm

Thanks!

I have to think about this some more. I’m adapting a PyTorch model that has sigmoid in the final layer for an output that’s between 0 and 1, where ideally they really should have most values at 0 and a few at 1, so I’m very curious why it worked for them.

BioTurboNick · October 11, 2024, 3:06pm

Yeah, padding values (e.g. mirroring), not size. Thanks though!

ToucheSir · October 11, 2024, 3:08pm

PyTorch’s Conv2d doesn’t do any padding by default, so it would help if you could post the equivalent Python code you’re trying to adapt. Otherwise it’s hard to make an apples-to-apples comparison.

BioTurboNick · October 11, 2024, 3:12pm

Sure, I’m just trying to improve my understanding here generally. FWIW they just used default zero-padding, so that apparently should work. I only wondered about the padding because the model I’m troubleshooting lights up all the edges and nothing else at the moment.

Topic		Replies	Views
Why isn't Flux Descent learning? New to Julia	1	495	December 23, 2019
Regularization with Flux Machine Learning flux	3	1128	October 29, 2020
Training layers of a Flux model separately Machine Learning question	1	376	November 13, 2021
Unstable learning of neural net with few nodes using Flux Machine Learning flux	9	590	November 28, 2021
Bug when training a custom model using Flux Machine Learning flux , training	2	357	February 18, 2023

Flux - sigmoid in last layer destroys learning?

Related topics