Why calculating gradients from Params is different than doing it directly?

I’ve encoutered a case when calculating gradients with two methods yield different results.

Here is an excerpt from my code, excluding the actual loss function.


julia> x, y=get_batch(4)
([("1c944ea1a3ae18a8df9f411d3c66827c8f26a0c76835deefe5c27acaf37f510d", "092fe98b5be67c2a426f8dc8b6242f339f4701099ec6cf44257b2828804d5854"), ("69ec244c61b4edc0118f6c5dc115dc0911729774d9848e81f4e903b824321d29", "2d19a989f1097ffe35458c49a4c48d3257148e84860961cec4482cdde87681a9"), ("7c79e8c4f90532b8754b7cb6499a070fa70aad5211266043847a0ec593061753", "bbf8173740f461e201356d30d1fe0b0a2fb07ff67c829124bc2702dd189bd855"), ("2b4971006d4b0b546b923503e5b7962e9f7b2531f9580ce9062b38b034b10c80", "5afaddd9c7c2671bdf549dcda7123921da83e0c0e68fbfd7f0cd6c7402a2c7f6")], [2, 2, 1, 1])

julia> θ=Params([weight_pars, sigma_pars])  # Our model
Params([[0.11504418023074425, 0.28379663653151455], [10.88, 54.4]])

julia> loss(x, y, weight_pars, sigma_pars) # the loss
0.4343795875304357

julia> gs = gradient(() -> loss(x, y, weight_pars, sigma_pars), θ)
Grads(...)

julia> gs[weight_pars],gs[sigma_pars] # One version of gradients
([-0.1021944543969813, 0.4054565903538764], [0.028360913853822473, 0.020073262552095503])


julia> gradient((w, s) -> loss(x, y, w, s), weight_pars, sigma_pars) # The other version of gradients
([-0.04534501818695343, 0.2169181270035139], [0.014180456926911236, 0.010036631276047751])

The gradients w.r.t. sigma_pars differ by factor 2. Gradients w.r.t. weight_pats not so much.

Shouldn’t get the same value for gradients irrespective on the syntax? Why I get different values?

I could post my loss function if that would help.

I use [e88e6eb3] Zygote v0.4.17 and [587475ba] Flux v0.10.4 on Julia 1.4.1

The most likely answer is that the loss function is “double dipping”, accessing the parameters both from the function argument inputs to loss and directly from global scope. Params will take account of the parameters wherever they come from, whereas standard gradient will only consider the function argument version of weight_pars and sigma_pars as contributing to the gradient, and use of those arrays from global scope will be ignored.

1 Like

Can you please elaborate a bit more? Which version do you think is correct?

If I define a fake loss function like

loss(x, y, weight_pars, sigma_pars)=log(sum(weight_pars.*weight_pars) + sum(sigma_pars.*sigma_pars))

I get both ways of defining gradients agree.

For example, I think you’re doing something like

x = 2
f(y) = x*y
gradient(f, x)

“Correct” is a definitional issue here, since we have two different ways to think about the function input. If we tweak x itself to x = 2 + ϵ we’ll get a gradient of 2x. If we tweak the unnamed second input to the function f we get f(x+ϵ) = x(x+ϵ) and the gradient is 2. Params asks for the former and plain gradient asks for the latter. Params is probably correct in the sense that it’s what you intended here.

This obviously only comes up because x is both a global variable and a function argument. So the easiest fix, assuming this is the issue, is to make sure your code consistently uses weight_pars and co either via global scope or via an explicit function argument, but not both ways.

2 Likes

After a lot of poking around my code I think I found the problem. My loss function uses

function loss(x, y, pars)
# ...
   return sum([el_prediction(el, param) for el in x])
end

instead of

function loss(x, y, pars)
# ...
   mysum::Float64 = 0
   for el in x
      mysum += el_prediction(el, param)
   end
   return mysum
end

I wish such behaviour was documented, or at least threw an error…

If it’s not what I suggested, and you can fix this just by changing a sum generator to a loop, you should open an issue – that clearly shouldn’t make a difference and suggests a bug in Zygote’s adjoints somewhere (probably not passing Params through when differentiating a generator, or something similar).