Problems relating to "gradient is nothing"

Hi, I am doing a machine learning using gradient descend

# I have a set of data, called samples
# I have two models: model1 and model2
global par = Flux.params(model1, model2) #wrap them in params
gs = gradient(par) do
                global l1 = mean(Flux.mse(model(sample),true1)) for sample in samples)
                global l2 = 0
                global count = 0
                global allLoss = 0
                for sample in samples
                    nbSec = count ÷ 6
                    true2 = label[nbSec+1,:,1]
                    global l2 += Flux.mse(model2(sample),true2)
                    count += 1
                end
                l2 = l2 / length(samples)
                allLoss = l1 + l2
                println("l: ",l1)    # print 1.6994402448439135e15
                println("l2: ",l2)  # print 3.004242981094107e17
                println("allLoss: ", allLoss)  # print 3.021237383542546e17
            end
            Flux.update!(opt, par, gs)

As you can see, the output of l1 and l2 are two float64 numbers. They are scale. However, when I use Flux.update, it gives me error:

ERROR: LoadError: Output should be scalar; gradients are not defined for output nothing

I wonder why I have this mistake

In your do block, the returned value is the value of the last expression. In this case, you actually return the output of a call to println, which is nothing (and it has no gradient).
The last line of the do block needs to be return allLoss, and it should work?

1 Like

Tangentially, I don’t think there is the need for all these globals in your snippet. For performance reason you should consider removing them because globals are desastrous for performance. See related performance tip:

1 Like

Note also that using Flux.params is discouraged, and it will be removed soon. The recommended style is something like this:

samples, label, true1, opt, model1, model2 = ...

opt_stats = Flux.setup(opt, (model1, model2))  # neccesary for explicit mode

gs = gradient(model1, model2) do m1, m2  # explicit gradient w.r.t. the models
    l1 = mean(Flux.mse(m1(sample), true1)) for sample in samples)
    cnt = 0
    for sample in samples
        nbSec = cnt ÷ 6
        true2 = label[nbSec+1,:,1]
        l2 += Flux.mse(m2(sample), true2)
        cnt += 1
    end
    l2 = l2 / length(samples)
    @show l1
    @show l2
    allLoss = l1 + l2
    @show allLoss  # unlike pritln, this also returns the value
end

Flux.update!(opt_state, (model1, model2), gs)  # update parameters withing models
1 Like

Yes. The problem is just the “println”. Thanks for helping!

1 Like

Yes. You are right. I delete them and it works well. Thanks for helping!

Thanks for helping! I tried your method and succeed. But I do have another problem. If I execute code like this:

using DelimitedFiles
using InferOpt
using ProgressMeter
using Flux
using Gurobi
using JuMP
using Distributions
using Statistics
using JSON
using Dates
using Parameters

model1 = Chain(
    Dense(ones(3,3),true,relu)
)
model2 = Chain(
    x -> model1(x),
    Dense(ones(1,3),true,relu)
)
model3 = Chain(
    x -> model1(x),
    Dense(ones(1,3),true,relu)
)
A = [1,2,3]
B = [1,1,1]
label = [10]
opt = Adam(0.1)
opt_stats = Flux.setup(opt, (model2, model3))
for i in 1:5
    gs = gradient(model2, model3) do m2, m3
        l1 = Flux.mse(m2(A),label)
        l2 = Flux.mse(m3(A),label)
        allLoss = l1 + l2
        @show l1
        @show l2
        @show allLoss
    end
    Flux.update!(opt_stats, (model1,model2), gs)
end

I will get print:

l1 = 64.0
l2 = 64.0
allLoss = 128.0
l1 = 37.21000000305003
l2 = 64.0
allLoss = 101.21000000305003
l1 = 17.64000000420013
l2 = 64.0
allLoss = 81.64000000420013
l1 = 5.290000003450065
l2 = 64.0
allLoss = 69.29000000345006
l1 = 0.16000000080001742
l2 = 64.0
allLoss = 64.16000000080001

Why my l2 never descends?

You might be in a region of the parameter space where all relu activation functions are on the left side of zero, which means the gradients vanish. I assume this effect would disappear with another activation function like tanh (although this is not necessarily a wise switch). More generally, you’re only seeing this because your test case is very small and you only do a few iterations.

Two issues:

  1. You’re again using Chain(x -> model1(x), ...) instead of Chain(model1, ...) in your definitions. Therefore the parameters of model1 are not visible in model2/model3.
  2. You are taking the gradient gradient(model2, model3) wrt model2, model3, but update Flux.update!(opt_stats, (model1,model2), gs) then model1, model2. Thus, model3 is never updated.

To fix this, either change the definitions of model2, model3 (i.e. fix 1.) and have gradient and update! work on model2, model3 only (model1 is then implicitly part of the Chain in model2, model3).
Or, leave the model definitions as is, but have gradient and update! take care of all models, i.e., model1, model2, model3.

2 Likes

Oh yes! I made a primary mistake.
Thanks for helping!!
But the framework seems have some problems when using GPU?

using CUDA
using Flux
using Random
Random.seed!(333)
model1 = Chain(
    Dense(3, 3) |> gpu
)

model2 = Chain(
    model1,
    Dense(ones(3,3),true,relu) |> gpu
)
model3 = Chain(
    model1,
    Dense(ones(3,3),true,relu) |> gpu
)
A = [1,2,3]
B = [1,1,1]
label = [10,10,10]
trainLoader1 = Flux.DataLoader((A, label), batchsize=64, shuffle=true) |> gpu
trainLoader2 = Flux.DataLoader((B, label), batchsize=64, shuffle=true) |> gpu
opt = Adam(0.1)
opt_stats = Flux.setup(opt, (model2, model3))
for i in 1:5
    global l1,l2
    gs = gradient(model2, model3) do m2, m3
        for (x, y) in trainLoader1
            l1 = Flux.mse(m2(x),y)
        end
        for (x, y) in trainLoader2
            l2 = Flux.mse(m3(x),y)
        end
        allLoss = l1 + l2
        @show l1
        @show l2
        @show allLoss
    end
    Flux.update!(opt_stats, (model2,model3), gs)
end

It gives error:

ERROR: LoadError: MethodError: no method matching +(::@NamedTuple{contents::@NamedTuple{data::@NamedTuple{data::@NamedTuple{f::Nothing, data::Tuple{Vector{Float64}, Vector{Float64}}}, indices::Nothing}, batchsize::Nothing, count::Nothing, partial::Nothing}}, ::Base.RefValue{Any})

Closest candidates are:
  +(::Any, ::Any, ::Any, ::Any...)
   @ Base operators.jl:587
  +(::ChainRulesCore.ZeroTangent, ::Any)
   @ ChainRulesCore ~/.julia/packages/ChainRulesCore/6Pucz/src/tangent_arithmetic.jl:99
  +(::Any, ::ChainRulesCore.NotImplemented)
   @ ChainRulesCore ~/.julia/packages/ChainRulesCore/6Pucz/src/tangent_arithmetic.jl:25

It seems that I cannot use tuple to wrap model2 and model3?

See also

Did this solve your issue?

1 Like

Yes. That’s what I realised before. I forgot to state in here

1 Like