Problems relating to "gradient is nothing"

Yaozhong_Liu · October 28, 2024, 3:06pm

Hi, I am doing a machine learning using gradient descend

# I have a set of data, called samples
# I have two models: model1 and model2
global par = Flux.params(model1, model2) #wrap them in params
gs = gradient(par) do
                global l1 = mean(Flux.mse(model(sample),true1)) for sample in samples)
                global l2 = 0
                global count = 0
                global allLoss = 0
                for sample in samples
                    nbSec = count ÷ 6
                    true2 = label[nbSec+1,:,1]
                    global l2 += Flux.mse(model2(sample),true2)
                    count += 1
                end
                l2 = l2 / length(samples)
                allLoss = l1 + l2
                println("l: ",l1)    # print 1.6994402448439135e15
                println("l2: ",l2)  # print 3.004242981094107e17
                println("allLoss: ", allLoss)  # print 3.021237383542546e17
            end
            Flux.update!(opt, par, gs)

As you can see, the output of l1 and l2 are two float64 numbers. They are scale. However, when I use Flux.update, it gives me error:

ERROR: LoadError: Output should be scalar; gradients are not defined for output nothing

I wonder why I have this mistake

gdalle · October 28, 2024, 4:07pm

In your do block, the returned value is the value of the last expression. In this case, you actually return the output of a call to println, which is nothing (and it has no gradient).
The last line of the do block needs to be return allLoss, and it should work?

abraemer · October 28, 2024, 7:18pm

Tangentially, I don’t think there is the need for all these globals in your snippet. For performance reason you should consider removing them because globals are desastrous for performance. See related performance tip:

mcabbott · October 28, 2024, 10:59pm

Note also that using Flux.params is discouraged, and it will be removed soon. The recommended style is something like this:

samples, label, true1, opt, model1, model2 = ...

opt_stats = Flux.setup(opt, (model1, model2))  # neccesary for explicit mode

gs = gradient(model1, model2) do m1, m2  # explicit gradient w.r.t. the models
    l1 = mean(Flux.mse(m1(sample), true1)) for sample in samples)
    cnt = 0
    for sample in samples
        nbSec = cnt ÷ 6
        true2 = label[nbSec+1,:,1]
        l2 += Flux.mse(m2(sample), true2)
        cnt += 1
    end
    l2 = l2 / length(samples)
    @show l1
    @show l2
    allLoss = l1 + l2
    @show allLoss  # unlike pritln, this also returns the value
end

Flux.update!(opt_state, (model1, model2), gs)  # update parameters withing models

Yaozhong_Liu · October 29, 2024, 4:58am

Yes. The problem is just the “println”. Thanks for helping!

Yaozhong_Liu · October 29, 2024, 4:59am

Yes. You are right. I delete them and it works well. Thanks for helping!

Yaozhong_Liu · October 29, 2024, 5:01am

Thanks for helping! I tried your method and succeed. But I do have another problem. If I execute code like this:

using DelimitedFiles
using InferOpt
using ProgressMeter
using Flux
using Gurobi
using JuMP
using Distributions
using Statistics
using JSON
using Dates
using Parameters

model1 = Chain(
    Dense(ones(3,3),true,relu)
)
model2 = Chain(
    x -> model1(x),
    Dense(ones(1,3),true,relu)
)
model3 = Chain(
    x -> model1(x),
    Dense(ones(1,3),true,relu)
)
A = [1,2,3]
B = [1,1,1]
label = [10]
opt = Adam(0.1)
opt_stats = Flux.setup(opt, (model2, model3))
for i in 1:5
    gs = gradient(model2, model3) do m2, m3
        l1 = Flux.mse(m2(A),label)
        l2 = Flux.mse(m3(A),label)
        allLoss = l1 + l2
        @show l1
        @show l2
        @show allLoss
    end
    Flux.update!(opt_stats, (model1,model2), gs)
end

I will get print:

l1 = 64.0
l2 = 64.0
allLoss = 128.0
l1 = 37.21000000305003
l2 = 64.0
allLoss = 101.21000000305003
l1 = 17.64000000420013
l2 = 64.0
allLoss = 81.64000000420013
l1 = 5.290000003450065
l2 = 64.0
allLoss = 69.29000000345006
l1 = 0.16000000080001742
l2 = 64.0
allLoss = 64.16000000080001

Why my l2 never descends?

gdalle · October 29, 2024, 6:03am

You might be in a region of the parameter space where all relu activation functions are on the left side of zero, which means the gradients vanish. I assume this effect would disappear with another activation function like tanh (although this is not necessarily a wise switch). More generally, you’re only seeing this because your test case is very small and you only do a few iterations.

bertschi · October 29, 2024, 6:26am

Two issues:

You’re again using Chain(x -> model1(x), ...) instead of Chain(model1, ...) in your definitions. Therefore the parameters of model1 are not visible in model2/model3.
You are taking the gradient gradient(model2, model3) wrt model2, model3, but update Flux.update!(opt_stats, (model1,model2), gs) then model1, model2. Thus, model3 is never updated.

To fix this, either change the definitions of model2, model3 (i.e. fix 1.) and have gradient and update! work on model2, model3 only (model1 is then implicitly part of the Chain in model2, model3).
Or, leave the model definitions as is, but have gradient and update! take care of all models, i.e., model1, model2, model3.

Yaozhong_Liu · October 29, 2024, 9:46am

Oh yes! I made a primary mistake.
Thanks for helping!!
But the framework seems have some problems when using GPU?

using CUDA
using Flux
using Random
Random.seed!(333)
model1 = Chain(
    Dense(3, 3) |> gpu
)

model2 = Chain(
    model1,
    Dense(ones(3,3),true,relu) |> gpu
)
model3 = Chain(
    model1,
    Dense(ones(3,3),true,relu) |> gpu
)
A = [1,2,3]
B = [1,1,1]
label = [10,10,10]
trainLoader1 = Flux.DataLoader((A, label), batchsize=64, shuffle=true) |> gpu
trainLoader2 = Flux.DataLoader((B, label), batchsize=64, shuffle=true) |> gpu
opt = Adam(0.1)
opt_stats = Flux.setup(opt, (model2, model3))
for i in 1:5
    global l1,l2
    gs = gradient(model2, model3) do m2, m3
        for (x, y) in trainLoader1
            l1 = Flux.mse(m2(x),y)
        end
        for (x, y) in trainLoader2
            l2 = Flux.mse(m3(x),y)
        end
        allLoss = l1 + l2
        @show l1
        @show l2
        @show allLoss
    end
    Flux.update!(opt_stats, (model2,model3), gs)
end

It gives error:

ERROR: LoadError: MethodError: no method matching +(::@NamedTuple{contents::@NamedTuple{data::@NamedTuple{data::@NamedTuple{f::Nothing, data::Tuple{Vector{Float64}, Vector{Float64}}}, indices::Nothing}, batchsize::Nothing, count::Nothing, partial::Nothing}}, ::Base.RefValue{Any})

Closest candidates are:
  +(::Any, ::Any, ::Any, ::Any...)
   @ Base operators.jl:587
  +(::ChainRulesCore.ZeroTangent, ::Any)
   @ ChainRulesCore ~/.julia/packages/ChainRulesCore/6Pucz/src/tangent_arithmetic.jl:99
  +(::Any, ::ChainRulesCore.NotImplemented)
   @ ChainRulesCore ~/.julia/packages/ChainRulesCore/6Pucz/src/tangent_arithmetic.jl:25

It seems that I cannot use tuple to wrap model2 and model3?

gdalle · October 29, 2024, 2:14pm

Topic		Replies	Views
Flux.gradient returns dict of param and Nothing Machine Learning flux	3	789	September 22, 2021
Params not getting updated during training New to Julia flux	25	1796	October 11, 2020
Computing Flux.gradient change the model Machine Learning	10	1396	January 31, 2021
Gradients of custom functions with Flux Machine Learning flux	2	1209	April 5, 2020
Take gradient of parameters not working Machine Learning question , flux	0	355	January 12, 2021

Problems relating to "gradient is nothing"

Related topics