Why calculating gradients from Params is different than doing it directly?

AdamR · April 27, 2020, 11:49am

I’ve encoutered a case when calculating gradients with two methods yield different results.

Here is an excerpt from my code, excluding the actual loss function.


julia> x, y=get_batch(4)
([("1c944ea1a3ae18a8df9f411d3c66827c8f26a0c76835deefe5c27acaf37f510d", "092fe98b5be67c2a426f8dc8b6242f339f4701099ec6cf44257b2828804d5854"), ("69ec244c61b4edc0118f6c5dc115dc0911729774d9848e81f4e903b824321d29", "2d19a989f1097ffe35458c49a4c48d3257148e84860961cec4482cdde87681a9"), ("7c79e8c4f90532b8754b7cb6499a070fa70aad5211266043847a0ec593061753", "bbf8173740f461e201356d30d1fe0b0a2fb07ff67c829124bc2702dd189bd855"), ("2b4971006d4b0b546b923503e5b7962e9f7b2531f9580ce9062b38b034b10c80", "5afaddd9c7c2671bdf549dcda7123921da83e0c0e68fbfd7f0cd6c7402a2c7f6")], [2, 2, 1, 1])

julia> θ=Params([weight_pars, sigma_pars])  # Our model
Params([[0.11504418023074425, 0.28379663653151455], [10.88, 54.4]])

julia> loss(x, y, weight_pars, sigma_pars) # the loss
0.4343795875304357

julia> gs = gradient(() -> loss(x, y, weight_pars, sigma_pars), θ)
Grads(...)

julia> gs[weight_pars],gs[sigma_pars] # One version of gradients
([-0.1021944543969813, 0.4054565903538764], [0.028360913853822473, 0.020073262552095503])


julia> gradient((w, s) -> loss(x, y, w, s), weight_pars, sigma_pars) # The other version of gradients
([-0.04534501818695343, 0.2169181270035139], [0.014180456926911236, 0.010036631276047751])

The gradients w.r.t. sigma_pars differ by factor 2. Gradients w.r.t. weight_pats not so much.

Shouldn’t get the same value for gradients irrespective on the syntax? Why I get different values?

I could post my loss function if that would help.

I use [e88e6eb3] Zygote v0.4.17 and [587475ba] Flux v0.10.4 on Julia 1.4.1

MikeInnes · April 27, 2020, 11:54am

The most likely answer is that the loss function is “double dipping”, accessing the parameters both from the function argument inputs to loss and directly from global scope. Params will take account of the parameters wherever they come from, whereas standard gradient will only consider the function argument version of weight_pars and sigma_pars as contributing to the gradient, and use of those arrays from global scope will be ignored.

AdamR · April 27, 2020, 1:00pm

Can you please elaborate a bit more? Which version do you think is correct?

If I define a fake loss function like

loss(x, y, weight_pars, sigma_pars)=log(sum(weight_pars.*weight_pars) + sum(sigma_pars.*sigma_pars))

I get both ways of defining gradients agree.

MikeInnes · April 27, 2020, 1:09pm

For example, I think you’re doing something like

x = 2
f(y) = x*y
gradient(f, x)

“Correct” is a definitional issue here, since we have two different ways to think about the function input. If we tweak x itself to x = 2 + ϵ we’ll get a gradient of 2x. If we tweak the unnamed second input to the function f we get f(x+ϵ) = x(x+ϵ) and the gradient is 2. Params asks for the former and plain gradient asks for the latter. Params is probably correct in the sense that it’s what you intended here.

This obviously only comes up because x is both a global variable and a function argument. So the easiest fix, assuming this is the issue, is to make sure your code consistently uses weight_pars and co either via global scope or via an explicit function argument, but not both ways.

AdamR · April 27, 2020, 6:00pm

After a lot of poking around my code I think I found the problem. My loss function uses

function loss(x, y, pars)
# ...
   return sum([el_prediction(el, param) for el in x])
end

instead of

function loss(x, y, pars)
# ...
   mysum::Float64 = 0
   for el in x
      mysum += el_prediction(el, param)
   end
   return mysum
end

I wish such behaviour was documented, or at least threw an error…

MikeInnes · April 28, 2020, 8:05am

If it’s not what I suggested, and you can fix this just by changing a sum generator to a loop, you should open an issue – that clearly shouldn’t make a difference and suggests a bug in Zygote’s adjoints somewhere (probably not passing Params through when differentiating a generator, or something similar).

Topic		Replies	Views
Flux/Zygote: Gradient with respect to inputs and implicit parameters (in 2021) Machine Learning question , flux , zygote	1	975	November 23, 2021
Understanding Flux.jl use of `gradient` and `params` Machine Learning flux	4	3539	October 2, 2021
Calling Flux.params() inside gradient changes output? Machine Learning flux , zygote	2	356	September 28, 2021
Flux.jl: params() and gradient() ocnfusion Machine Learning	4	655	August 23, 2021
Flux.gradient returns dict of param and Nothing Machine Learning flux	3	764	September 22, 2021

Why calculating gradients from Params is different than doing it directly?

Related topics