Best way to prevent AD to differentiate through useless or zero gradient function

I want to minimize a function of the type:
x = \textrm{argmin}_x \min_a || f(x) a - y ||^2

as it is quadratic in a we know the optimum in a is
a(x) = (f(x)^T f(x))^{-1} f(x)^T y

so we minimize
x = \text{argmin}_x || f(x) a(x) - y ||^2
to compute the objective function gradient we don’t need to propagate the gradient through the a(x) function as we know it is zero.

What is the optimal to put this knowledge that cannot be guessed by the AD framework?

For now, I use the @nograd macro of Zygote but then I’m stuck with this AD framework. Here is my MWE :

using LinearAlgebra, Zygote, ForwardDiff, Mooncake

const n = 1000
const k = range(0, 10, length=n) .* 2π
y = 0.1.*sin.(k .+ 2.5) .+ 0.1 * randn(n) .+ randn(1)
function lin_solve(m, y)
	return (m' * m) \ m' * y
end

Zygote.@nograd function lin_solve_nograd(m, y)
	return (m' * m) \ m' * y
end

function objective(x)
	m = hcat(sin.(k .+ x),ones(n))
	α = lin_solve(m, y)
	return norm(m*α  .- y)
end

function objective_nograd(x)
	m = hcat(sin.(k .+ x),ones(n))
	α = lin_solve_nograd(m, y)
	return norm(m*α  .- y)
end

backendZ = AutoZygote()
backendFD = AutoForwardDiff()
backendM = AutoMooncake(;config=nothing)

x= [1.0]

Here are the timings

julia> @btime DifferentiationInterface.gradient(objective, backendZ, x)
  65.315 μs (145 allocations: 283.08 KiB)
1-element Vector{Float64}:
 -0.06677430082653298

julia> @btime DifferentiationInterface.gradient(objective_nograd, backendZ, x)
  39.641 μs (100 allocations: 139.55 KiB)
1-element Vector{Float64}:
 -0.06677430082653381

julia> @btime DifferentiationInterface.gradient(objective, backendFD, x)
  57.628 μs (33 allocations: 118.28 KiB)
1-element Vector{Float64}:
 -0.06677430082653303

julia> @btime DifferentiationInterface.gradient(objective_nograd, backendFD, x)
  57.569 μs (33 allocations: 118.28 KiB)
1-element Vector{Float64}:
 -0.06677430082653303


julia> @btime DifferentiationInterface.gradient(objective, prep,backendM, x)
  232.240 μs (273 allocations: 251.92 KiB)
1-element Vector{Float64}:
 -0.06677430082653296

julia> @btime DifferentiationInterface.gradient(objective_nograd, prep,backendM, x)
  235.357 μs (273 allocations: 251.92 KiB)
1-element Vector{Float64}:
 -0.06677430082653296

The @nograd macro of Zygote seems to be quite useful but unfortunately it is not seen by the other AD framework.

I’m not sure I understand what you mean by that, and I’m a bit puzzled by the two versions of the gradient returning the same value. a(x) does depend on x, doesn’t it?

a(x) does depend on x but as \frac{\partial L(x)}{\partial a(x)} = 0 with L(x) = || f(x) a(x) - y||^2 then whatever is \frac{\partial a(x)}{\partial x} this part of the gradient cancels leaving only the part with \frac{\partial f(x)}{\partial x}.
This kind of situation happens quite often when one wants to “marginalize” these linear coefficients a