Problem on model and gradient descend in Flux

Yaozhong_Liu · October 22, 2024, 1:06pm

Hi. I have a problem on gradient descend in Flux.
Suppose I have a model like this:

model1 = Chain(x->Dense(5,5),relu)
model2 = Chain(x->model1(x), Dense(5,5), vec)
model3 = Chain(x->model1(x), Dense(5,5), vec)

And I wrap them in params:

parameters = Flux.params(model1,model2,model3)

And I use gradient descend:

gs = gradient(parameters) do
    loss1 = model2(x) - trueX
    loss2 = model3(y) - trueY
    loss = loss1 + loss2
end

My target is to learn two things. They are both learnt from model1, but I use model2 to output one thing and model3 to output the other. I am curious whether I can use codes above to realize it.
Thank you!

bertschi · October 22, 2024, 5:47pm

Don’t think that will work as expected:

While you can pass functions to Chain they will be opaque, i.e., Flux cannot see inside to get parameters. Further, your function x -> Dense(5, 5) never calls the dense layer!
Simply use model1 = Dense(5 => 5, relu) or Chain(Dense(5 => 5), relu) instead.

You can combine all your model parts into a single model:

model = Chain(Dense(4 => 5, relu), # your model1
              Parallel(tuple, # combine both model outputs into tuple
                       Dense(5 => 6),   # model2
                       Dense(5 => 7)))  # model3

# Use as follows ... note that I have changed the dimensions to better understand where each value is coming from
batch = rand(4, 8)
size.(model(batch))  # will be ((6, 8), (7, 8))

gradient(model) do m
    m2, m3 = m(batch)
    loss1 = m2 .- trueX
    loss2 = m3 .- trueY
    sum(vcat(loss1, loss2))
end

Yaozhong_Liu · October 23, 2024, 12:42am

Thank you!
I understood. Your explanation is quite clear!

Yaozhong_Liu · October 23, 2024, 12:44am

May I ask you for another related question.
Suppose I have two separate models. I want to concatenate their outputs and put it into a new model. Codes like this:

model1 = Chain(Dense(4 => 5),vec)
model2 = Chain(Dense(4 => 6),vec)
model3 = Chain(
    x->cat(model1(x),model2(x),dims=1),
    Dense(11 => 11),
    vec
)

Will this code work?
Thanks for your kind and patient reply!

bertschi · October 23, 2024, 6:37am

Not if you want to train model1 and model2 as well. Again, the paramaters of them will not be seen inside the function:

julia> model1 = Chain(Dense(4 => 5),vec)
Chain(
  Dense(4 => 5),                        # 25 parameters
  vec,
) 

julia> model2 = Chain(Dense(4 => 6),vec)
Chain(
  Dense(4 => 6),                        # 30 parameters
  vec,
) 

julia> model3 = Chain(
           x->cat(model1(x),model2(x),dims=1),
           Dense(11 => 11),
           vec
       )
Chain(
  var"#7#8"(),
  Dense(11 => 11),                      # 132 parameters
  vec,
) 

# model3 only has the parameters of the last Dense layer
julia> Dense(11 => 11)
Dense(11 => 11)     # 132 parameters

# Use Parallel again to combine the sub-models -- now the parameters are all visible
julia> model3 = Chain(
           Parallel((x,y)->cat(x,y,dims=1), model1, model2),
           Dense(11 => 11),
           vec
       )
Chain(
  Parallel(
    var"#11#12"(),
    Chain(
      Dense(4 => 5),                    # 25 parameters
      vec,
    ),
    Chain(
      Dense(4 => 6),                    # 30 parameters
      vec,
    ),
  ),
  Dense(11 => 11),                      # 132 parameters
  vec,
)                   # Total: 6 arrays, 187 parameters, 1.035 KiB.

PS: Also not sure about the vec at the end of your model. In Flux models usually work on batches of input, i.e.,

julia> m = Dense(4 => 5)
Dense(4 => 5)       # 25 parameters

julia> size(m(rand(4)))  # single input vector
(5,)

julia> size(m(rand(4, 8)))  # batch of 8 inputs
(5, 8)

# model with vec eliminates batch dimension
julia> size(model1(rand(4, 8)))
(40,)

CarloLucibello · October 23, 2024, 7:33am

Let me add that often it is more convenient to wrap the entire model inside a custom struct and define a forward pass instead of using Chain and Parallel:

using Flux

struct Model{D1, D2}
    dense1::D1
    dense2::D2
end

Flux.@layer Model

function Model()
    return Model(
        Dense(4 => 5),
        Dense(4 => 6))
end

function (m::Model)(x)
    x1 = m.dense1(x)
    x2 = m.dense2(x)
    return cat(x1, x2, dims=1)
end

# x = rand(Float32, 4) # with no batch dimension
# y = rand(Float32, 11)
x = rand(Float32, 4, 5) # 5 examples in a batch
y = rand(Float32, 11, 5)
loss(model, x, y) = Flux.mse(model(x), y)

model = Model()
opt_state = Flux.setup(AdamW(), model)
g = gradient(model -> loss(model, x, y), model)[1]
Flux.update!(opt_state, model, g)

Yaozhong_Liu · October 23, 2024, 1:19pm

Thanks for detailed instructions!
I think I’ve learnt to operate correctly now.
Much grate for your anwser!

Yaozhong_Liu · October 23, 2024, 1:32pm

Thanks for your kind instructions!
I tried your code and succeeded. It’s quite logistic compared to my past version.
They are very helpful!
Much grate for your help!

Yaozhong_Liu · October 24, 2024, 3:32am

Hi.
May I ask about a problem relating to this.
coding like this:

#In one file, I defined a model like above
function m(parameters)
  model = Chain(Dense(4 => 5, relu), # your model1
                Parallel(tuple, # combine both model outputs into tuple
                         Dense(5 => 6),   # model2
                         Dense(5 => 7)))  # model3
  return f64(model) #return the model
end

I understood that given an input, we can get output of model2 by using m2,_ = model(input).
But I want to get the model2 rather than its output.
For I want to pass that model to a loss function rather than a detailed number.
If use the code provided by you, I encounter errors.

#In another file, I include that model
loss1 = input -> loss(m2(input)) #loss is a function defined before. Input is a data type and not initialized.
m2,m3 = m(parameters) #Here I want to receive two models, I don't know whether it works for my target.
par = Flux.params(m2,m3)
gs = gradient(par)
    l1 = 0
    l2 = 0
    for data in datas
        l1 += loss1(m2(data))
        l2 += mse(m3(data), trueM3)
    allLoss = sum(vcat(l1,l2))
end
Flux.update!(opt, par, gs)

I cannot use your codes for tasks like this.
Can you kindly give me some tips?
Thank you!

bertschi · October 24, 2024, 6:48am

Ok, as m is just a function just change it to return what you need:

# From what I understand you want something like this
function m()  # arguments parameters was not used anyways
    model1 = Dense(4 => 5)  # shared between model2 and model3
    model2 = Chain(model1, Dense(5 => 6))
    model3 = Chain(model1, Dense(5 => 7))
    return model2, model3
end

Maybe some further remarks:

Are you sure you need to convert your models to Float64, i.e., f64(model)?
As your two loss components l1, l2 are scalar, you can just add them, i.e., no need for sum(vcat(l1, l2)). Just do: l1 + l2.

Yaozhong_Liu · October 24, 2024, 7:33am

Thanks for replying!
I got it. So my fault in the first code:

model1 = Chain(x->Dense(5,5),relu)
model2 = Chain(x->model1(x), Dense(5,5), vec)
model3 = Chain(x->model1(x), Dense(5,5), vec)
parameters = Flux.params(model1,model2,model3)
gs = gradient(parameters) do
    loss1 = model2(x) - trueX
    loss2 = model3(y) - trueY
    loss = loss1 + loss2
end

The fault is just change

model2 = Chain(x->model1(x), Dense(5,5), vec)

to

model2 = Chain(model1, Dense(5,5), vec)

And for

parameters = Flux.params(model2,model3)
gs = gradient(parameters) do
    loss1 = model2(x) - trueX
    loss2 = model3(y) - trueY
    loss = loss1 + loss2
end

It will backward loss and update parameters in model2 then model1 then model3 then model1?

Is my understanding correct?

BioTurboNick · October 25, 2024, 4:58pm

For the model, yes you got it, for the last part, not quite. What you have there just calculates the gradients, it doesn’t update the models. For that you need to use an optimizer.

mcabbott · October 25, 2024, 11:50pm

Everything with Flux.params is deprecated and about to be removed. This answer above uses the new style, with Flux.setup. This code fragment should be something like this:

opt23 = Flux.setup(Adam, (model2, model3))

gs23 = gradient(model2, model3) do m2, m3
    loss1 = m2(x) - trueX
    loss2 = m3(y) - trueY
    loss = loss1 + loss2
end

Flux.update!(opt23, (model2, model3), gs23)

But how this fits into the whole thread above I’m not sure.

Yaozhong_Liu · October 26, 2024, 3:42pm

so, if I add code

Flux.update!(opt, par, gs)

Will it obey the rule I mentioned: It will backward loss and update parameters in model2 then model1 then model3 then model1?

Yaozhong_Liu · October 26, 2024, 3:47pm

Thanks for helping!

I encountered a loss no changing problem in my previous ML framework, which is based on the Flux.params(). No matter what learning rates I use, the loss just kept all the same. Will the deprecation of Flux.params() cause that?

And for code:

Flux.update!(opt23, (model2, model3), gs23)

Will it follow the order that: It will backward loss and update parameters in model2 then model1 then model3 then model1?

BioTurboNick · October 26, 2024, 6:11pm

Everything should be in 1 model object. For example, Flux.Chain or Flux.Parallel.

This previous answer is getting closest to what you seem to want: Problem on model and gradient descend in Flux - #2 by bertschi

Yaozhong_Liu · October 27, 2024, 12:09am

Yes. I understand that anwser. But I wander the gradient descending order.
For

Chain(Dense(4 => 5, relu), # your model1
              Parallel(tuple, # combine both model outputs into tuple
                       Dense(5 => 6),   # model2
                       Dense(5 => 7)))  # model3

Will we first backward layer for model2, then layer for model1, then layer for model3, then layer for model3, then layer for model1
Or
Will we first backward layer for model2, then layer for model3, then layer for model1?

BioTurboNick · October 27, 2024, 12:27am

model2 and model3, then model1, because model1’s gradient depends on model2 and model3.

Yaozhong_Liu · October 27, 2024, 1:05am

Got it! Thanks!

Topic		Replies	Views
Splitting and joining Flux model chains Machine Learning question , flux	4	2270	December 1, 2023
Training layers of a Flux model separately Machine Learning question	1	380	November 13, 2021
Flux: multiple input of unequal dimensions Machine Learning flux	4	1304	September 7, 2020
Stacking layers example Flux - Flux.params empty New to Julia question	2	833	December 12, 2019
Found Bug in Flux General Usage question , package , bug , flux	13	1181	July 11, 2022

Problem on model and gradient descend in Flux

Related topics