Best way to prevent AD to differentiate through useless or zero gradient function

FerreolS · March 13, 2025, 5:53pm

I want to minimize a function of the type:
x = \textrm{argmin}_x \min_a || f(x) a - y ||^2

as it is quadratic in a we know the optimum in a is
a(x) = (f(x)^T f(x))^{-1} f(x)^T y

so we minimize
x = \text{argmin}_x || f(x) a(x) - y ||^2
to compute the objective function gradient we don’t need to propagate the gradient through the a(x) function as we know it is zero.

What is the optimal to put this knowledge that cannot be guessed by the AD framework?

For now, I use the @nograd macro of Zygote but then I’m stuck with this AD framework. Here is my MWE :

using LinearAlgebra, Zygote, ForwardDiff, Mooncake

const n = 1000
const k = range(0, 10, length=n) .* 2π
y = 0.1.*sin.(k .+ 2.5) .+ 0.1 * randn(n) .+ randn(1)
function lin_solve(m, y)
	return (m' * m) \ m' * y
end

Zygote.@nograd function lin_solve_nograd(m, y)
	return (m' * m) \ m' * y
end

function objective(x)
	m = hcat(sin.(k .+ x),ones(n))
	α = lin_solve(m, y)
	return norm(m*α  .- y)
end

function objective_nograd(x)
	m = hcat(sin.(k .+ x),ones(n))
	α = lin_solve_nograd(m, y)
	return norm(m*α  .- y)
end

backendZ = AutoZygote()
backendFD = AutoForwardDiff()
backendM = AutoMooncake(;config=nothing)

x= [1.0]

Here are the timings

julia> @btime DifferentiationInterface.gradient(objective, backendZ, x)
  65.315 μs (145 allocations: 283.08 KiB)
1-element Vector{Float64}:
 -0.06677430082653298

julia> @btime DifferentiationInterface.gradient(objective_nograd, backendZ, x)
  39.641 μs (100 allocations: 139.55 KiB)
1-element Vector{Float64}:
 -0.06677430082653381

julia> @btime DifferentiationInterface.gradient(objective, backendFD, x)
  57.628 μs (33 allocations: 118.28 KiB)
1-element Vector{Float64}:
 -0.06677430082653303

julia> @btime DifferentiationInterface.gradient(objective_nograd, backendFD, x)
  57.569 μs (33 allocations: 118.28 KiB)
1-element Vector{Float64}:
 -0.06677430082653303


julia> @btime DifferentiationInterface.gradient(objective, prep,backendM, x)
  232.240 μs (273 allocations: 251.92 KiB)
1-element Vector{Float64}:
 -0.06677430082653296

julia> @btime DifferentiationInterface.gradient(objective_nograd, prep,backendM, x)
  235.357 μs (273 allocations: 251.92 KiB)
1-element Vector{Float64}:
 -0.06677430082653296

The @nograd macro of Zygote seems to be quite useful but unfortunately it is not seen by the other AD framework.

gdalle · March 13, 2025, 7:21pm

I’m not sure I understand what you mean by that, and I’m a bit puzzled by the two versions of the gradient returning the same value. a(x) does depend on x, doesn’t it?

FerreolS · March 13, 2025, 9:05pm

a(x) does depend on x but as \frac{\partial L(x)}{\partial a(x)} = 0 with L(x) = || f(x) a(x) - y||^2 then whatever is \frac{\partial a(x)}{\partial x} this part of the gradient cancels leaving only the part with \frac{\partial f(x)}{\partial x}.
This kind of situation happens quite often when one wants to “marginalize” these linear coefficients a

Topic		Replies	Views
Flux loss: Gradient wrt input leads to empty gradient wrt parameters or to "can't differentiate foreigncall" Machine Learning flux , forwarddiff , diffeqflux	3	558	April 8, 2022
Compute gradient of gradient norm using zygote New to Julia zygote	17	2024	August 26, 2022
Nested and different AD methods altogether: How to add AD calculations inside my loss function when using neural differential equations? Machine Learning sciml , ad , neural-network , differentialequation	9	987	September 28, 2024
How to take the gradient of an ODE system with respect to many data points? Performance question , zygote , modelingtoolkit , sindy	17	418	April 25, 2024
Nested AD with Lux etc Machine Learning ad	26	1277	May 1, 2024

Best way to prevent AD to differentiate through useless or zero gradient function

Related topics