Compute gradient of gradient norm using zygote

brentian · April 19, 2022, 11:04pm

I’m new to Zygote.jl and I’d like to compute the squared sum of a gradient, i.e.,

g = gradient(f, x),

q = gradient(g’*g, x)

Sucessfully did this using Forward mode:

function grad_g_sqr(f, d, x)
    function gs(x)
        a = ForwardDiff.gradient(f, x)
        return a' * a / 2
    end
    gg = ForwardDiff.gradient(gs, x)
    return gg
end

I heard backward mode is better so try to use Zygote,

# change to backward ad.
function hessz(f, d, x)
    gs(x) = sum(Zygote.gradient(f, x)[1] .^ 2)
    gg = Zygote.gradient(gs, x)[1]
    Hd = Zygote.gradient(gd, x)[1]
    return gg
end

Then I caught the exception,

ERROR: Can't differentiate foreigncall expression

Is there anything I missed?

gdalle · April 20, 2022, 5:22am

Hi and welcome to the community!
Could you post a minimal working example so that it is easier to help you?

At first glance, it seems the problem comes from nested derivatives with Zygote: maybe this link can help

github.com/FluxML/Zygote.jl

nested derivative does not work

opened 02:23AM - 30 Oct 18 UTC

afternone

```julia julia> derivative(x->x*derivative(y->x+y,1),1) ERROR: MethodError: no… method matching exprtype(::Core.Compiler.IRCode, ::String) Closest candidates are: exprtype(::Core.Compiler.IRCode, ::Expr) at C:\Users\han\.julia\packages\Zygote\5mNII\src\tools\ir.jl:54 exprtype(::Core.Compiler.IRCode, ::QuoteNode) at C:\Users\han\.julia\packages\Zygote\5mNII\src\tools\ir.jl:51 exprtype(::Core.Compiler.IRCode, ::GlobalRef) at C:\Users\han\.julia\packages\Zygote\5mNII\src\tools\ir.jl:50 ... Stacktrace: [1] _broadcast_getindex_evalf at .\broadcast.jl:574 [inlined] [2] _broadcast_getindex at .\broadcast.jl:547 [inlined] [3] getindex at .\broadcast.jl:507 [inlined] [4] copyto_nonleaf!(::Array{DataType,1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(Zygote.exprtype),Tuple{Base.RefValue{Core.Compiler.IRCode},Base.Broadcast.Extruded{Array{Any,1},Tuple{Bool},Tuple{Int64}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at .\broadcast.jl:923 [5] copy(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(Zygote.exprtype),Tuple{Base.RefValue{Core.Compiler.IRCode},Array{Any,1}}}) at .\broadcast.jl:786 [6] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(Zygote.exprtype),Tuple{Base.RefValue{Core.Compiler.IRCode},Array{Any,1}}}) at .\broadcast.jl:748 [7] record!(::Core.Compiler.IRCode) at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\reverse.jl:144 [8] #Primal#39(::Int64, ::Type, ::Core.Compiler.IRCode) at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\reverse.jl:189 [9] Type at .\none:0 [inlined] [10] #Adjoint#65 at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\reverse.jl:387 [inlined] [11] (::getfield(Core, Symbol("#kw#Type")))(::NamedTuple{(:varargs,),Tuple{Int64}}, ::Type{Zygote.Adjoint}, ::Core.Compiler.IRCode) at .\none:0 [12] _lookup_grad(::Type) at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\emit.jl:121 [13] #s18#633 at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\interface2.jl:17 [inlined] [14] #s18#633(::Any, ::Any, ::Any) at .\none:0 [15] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any,N} where N) at .\boot.jl:506 [16] derivative at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\interface.jl:37 [inlined] [17] (::Zygote.J{Tuple{typeof(derivative),getfield(Main, Symbol("##32#34")){Int64},Int64},Tuple{typeof(derivative),getfield(Main, Symbol("##32#34")){Int64},Int64,getfield(Zygote, Symbol("##148#back2#115")){getfield(Zygote, Symbol("##111#113")){1,Int64}},Zygote.J{Tuple{typeof(gradient),getfield(Main, Symbol("##32#34")){Int64},Int64},Tuple{typeof(gradient)}}}})(::Int64) at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\interface2.jl:0 [18] #31 at .\REPL[11]:1 [inlined] [19] (::Zygote.J{Tuple{getfield(Main, Symbol("##31#33")),Int64},Tuple{getfield(Main, Symbol("##31#33")),Int64,getfield(Zygote, Symbol("##796#back2#450")){getfield(Zygote, Symbol("##448#449")){Int64,Int64}},Zygote.J{Tuple{typeof(derivative),getfield(Main, Symbol("##32#34")){Int64},Int64},Tuple{typeof(derivative),getfield(Main, Symbol("##32#34")){Int64},Int64,getfield(Zygote, Symbol("##148#back2#115")){getfield(Zygote, Symbol("##111#113")){1,Int64}},Zygote.J{Tuple{typeof(gradient),getfield(Main, Symbol("##32#34")){Int64},Int64},Tuple{typeof(gradient)}}}},getfield(Zygote, Symbol("##194#back2#147")){Zygote.Jnew{getfield(Main, Symbol("##32#34")){Int64},Nothing}}}})(::Int64) at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\interface2.jl:0 [20] (::getfield(Zygote, Symbol("##66#67")){Zygote.J{Tuple{getfield(Main, Symbol("##31#33")),Int64},Tuple{getfield(Main, Symbol("##31#33")),Int64,getfield(Zygote, Symbol("##796#back2#450")){getfield(Zygote, Symbol("##448#449")){Int64,Int64}},Zygote.J{Tuple{typeof(derivative),getfield(Main, Symbol("##32#34")){Int64},Int64},Tuple{typeof(derivative),getfield(Main, Symbol("##32#34")){Int64},Int64,getfield(Zygote, Symbol("##148#back2#115")){getfield(Zygote, Symbol("##111#113")){1,Int64}},Zygote.J{Tuple{typeof(gradient),getfield(Main, Symbol("##32#34")){Int64},Int64},Tuple{typeof(gradient)}}}},getfield(Zygote, Symbol("##194#back2#147")){Zygote.Jnew{getfield(Main, Symbol("##32#34")){Int64},Nothing}}}}})(::Int64) at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\interface.jl:28 [21] gradient(::Function, ::Int64) at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\interface.jl:34 [22] derivative(::Function, ::Int64) at C:\Users\han\.julia\packages\Zygote\5mNII\src\compiler\interface.jl:37 [23] top-level scope at none:0 ```

ToucheSir · April 20, 2022, 7:52pm

Nested reverse AD with Zygote is very limited in terms of what it supports and (as I understand it) is almost never what you want. If you want to compute a hessian, check out Utilities · Zygote.

cortner · April 20, 2022, 9:01pm

I don’t think th OP wants a hessian? (Correct me if I’m wrong!)

I actually have the same situation quite often and fix it by writing wrappers for the chain rules. And then chainrules for the chainrules.

Sorry this is a really short and confusing reply. I’ll try to put together a script later to demonstrate what I mean.

goerch · April 20, 2022, 9:10pm

That would be beautiful, because we receive quite a number of complaints regarding AD and it would simplify things to have a common line of defense;)

ToucheSir · April 20, 2022, 10:28pm

I was going off of function hessz(...), but mixing forward + reverse modes is just as applicable to any kind of nested AD.

brentian · April 20, 2022, 11:09pm

Thanks gdalle, I finally wrapped up an example that is working. What I did:

julia> ff(x) = sum((x .- 1).^2)
ff (generic function with 1 method)
julia> using Zygote
julia> function hessz(f, x)
                  gs(x) = sum(Zygote.gradient(f, x)[1] .^ 2 / 2)
                  gg = Zygote.gradient(gs, x)[1]
                  return gg
              end
hessz (generic function with 1 method)

julia> hessz(ff, zeros(3))
3-element Vector{Float64}:
 -4.0
 -4.0
 -4.0

In a more complex example, I realized that the problem is using Flux.logitcrossentropy, something like,

using Flux
using MLDatasets
using Flux: logitcrossentropy, normalise, onecold, onehotbatch
using Statistics: mean
using Zygote
using Parameters: @with_kw

@with_kw mutable struct Args
    lr::Float64 = 0.5
    repeat::Int = 110
end

function get_processed_data(args)
    labels = MLDatasets.Iris.labels()
    features = MLDatasets.Iris.features()

    # Subract mean, divide by std dev for normed mean of 0 and std dev of 1.
    normed_features = normalise(features, dims=2)

    klasses = sort(unique(labels))
    onehot_labels = onehotbatch(labels, klasses)

    # Split into training and test sets, 2/3 for training, 1/3 for test.
    train_indices = [1:3:150; 2:3:150]

    X_train = normed_features[:, train_indices]
    y_train = onehot_labels[:, train_indices]

    X_test = normed_features[:, 3:3:150]
    y_test = onehot_labels[:, 3:3:150]

    #repeat the data `args.repeat` times

    train_data_iter = Iterators.repeated((X_train, y_train), args.repeat)
    train_data = (X_train, y_train)
    test_data = (X_test, y_test)

    return train_data, train_data_iter, test_data
end



# Initialize hyperparameter arguments
args = Args(; lr=0.1)

#Loading processed data
train_data, train_data_iter, test_data = get_processed_data(args)
x_train, yc_train = train_data
x_test, yc_test = test_data
function logit_model(wbv, x)
    wb = reshape(wbv, 3, :)
    return wb[:, 1:end-1] * x .+ wb[:, end]
end

loss_train(wb) = Flux.logitcrossentropy(logit_model(wb, x_train), yc_train)

w0 = ones(15)

Then proceed,

julia> function hessz(f, x)
                         gs(x) = sum(Zygote.gradient(f, x)[1] .^ 2 / 2)
                         gg = Zygote.gradient(gs, x)[1]
                         return gg
                     end
hessz (generic function with 1 method)

julia> hessz(loss_train, w0)
ERROR: Can't differentiate foreigncall expression
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] Pullback
    @ ./iddict.jl:102 [inlined]
  [3] (::typeof(∂(get)))(Δ::Nothing)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
  [4] Pullback
    @ ~/.julia/packages/Zygote/H6vD3/src/lib/lib.jl:68 [inlined]
  [5] (::typeof(∂(accum_global)))(Δ::Nothing)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
  [6] Pullback
    @ ~/.julia/packages/Zygote/H6vD3/src/lib/lib.jl:79 [inlined]
  [7] (::typeof(∂(λ)))(Δ::Nothing)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
  [8] Pullback
    @ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:67 [inlined]
  [9] (::typeof(∂(λ)))(Δ::Nothing)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [10] getindex
    @ ./tuple.jl:29 [inlined]
 [11] map
    @ ./tuple.jl:222 [inlined]
 [12] unthunk_tangent
    @ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:36 [inlined]
 [13] #1630#back
    @ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:67 [inlined]
 [14] (::typeof(∂(λ)))(Δ::Tuple{Nothing, Vector{Float64}})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [15] Pullback
    @ ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:41 [inlined]
 [16] (::typeof(∂(λ)))(Δ::Tuple{Vector{Float64}})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [17] Pullback
    @ ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:76 [inlined]
 [18] (::typeof(∂(gradient)))(Δ::Tuple{Vector{Float64}})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [19] Pullback
    @ ./REPL[1]:2 [inlined]
 [20] (::typeof(∂(gs)))(Δ::Float64)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [21] (::Zygote.var"#56#57"{typeof(∂(gs))})(Δ::Float64)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:41
 [22] gradient(f::Function, args::Vector{Float64})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:76
 [23] hessz(f::Function, x::Vector{Float64})
    @ Main ./REPL[1]:3
 [24] top-level scope
    @ REPL[2]:1

brentian · April 20, 2022, 11:10pm

Thanks for the comment. I was actually trying to avoid using full Hessian. Do you have an example on how to use this mixing mode?

brentian · April 20, 2022, 11:12pm

Looking forward!

ToucheSir · April 20, 2022, 11:48pm

Have a gander at Gradient of Gradient in Zygote - #3 by ChrisRackauckas.

cortner · April 21, 2022, 4:40pm

I really think the whole discussion of gradient of gradient vs Jacobean and using mixed backward forward doesn’t apply here. To compute gradient of f we use of course backward. Then g'*g is again a scalar so we should of course use backward mode. The canonical situation which is why I’m interested in it, and I assume might be the case here too, is when you train a model on gradients of the output.

So I looked back into my own codes and I have to admit I misremembered a few details. The reason we can do this is because we implement our own rrules for all first derivatives. If you do that, then you can take two zygote gradients. But - so far at least - I haven’t managed to produce a simple example where I take two derivatives without any intervention of this kind. I’ll keep trying and report back here if I find something.

Here is my toy example - I fully appreciate the is not what OP asked. The reason this works really well for us is that we are more than happy implementing the gradients ourselves (in fact we have to for performance reasons) and then just let Zygote differentiate the loss.

using Zygote, ChainRules
import ChainRules: rrule, NoTangent 

f(x) = sum(x[i]*x[i+1] for i = 1:length(x)-1)

function rrule(::typeof(f), x) 
  _pb_f(x, w::Number) = w * [ [x[2]]; [ x[i-1]+x[i+1] for i = 2:length(x)-1]; [x[end-1]] ]
  _pb_f(x, w) = (@show w; error("no pb for this")) 
  return f(x), w -> (NoTangent(), _pb_f(x, w))
end

grad_f(x) = Zygote.gradient(f, x)[1] 
L(x) = sum( grad_f(x).^2 )

x = rand(10)
@show L(x)
@show Zygote.gradient(L, x)[1]

brentian · April 22, 2022, 12:07am

cortner:

I really think the whole discussion of gradient of gradient vs Jacobean and using mixed backward forward doesn’t apply here. To compute gradient of f we use of course backward. Then g'*g is again a scalar so we should of course use backward mode. The canonical situation which is why I’m interested in it, and I assume might be the case here too, is when you train a model on gradients of the output.

So I looked back into my own codes and I have to admit I misremembered a few details. The reason we can do this is because we implement our own rrules for all first derivatives. If you do that, then you can take two zygote gradients. But - so far at least - I haven’t managed to produce a simple example where I take two derivatives without any intervention of this kind. I’ll keep trying and report back here if I find something.

Here is my toy example - I fully appreciate the is not what OP asked. The reason this works really well for us is that we are more than happy implementing the gradients ourselves (in fact we have to for performance reasons) and then just let Zygote differentiate the loss.
using Zygote, ChainRules
import ChainRules: rrule, NoTangent 

f(x) = sum(x[i]*x[i+1] for i = 1:length(x)-1)

function rrule(::typeof(f), x) 
  _pb_f(x, w::Number) = w * [ [x[2]]; [ x[i-1]+x[i+1] for i = 2:length(x)-1]; [x[end-1]] ]
  _pb_f(x, w) = (@show w; error("no pb for this")) 
  return f(x), w -> (NoTangent(), _pb_f(x, w))
end

grad_f(x) = Zygote.gradient(f, x)[1] 
L(x) = sum( grad_f(x).^2 )

x = rand(10)
@show L(x)
@show Zygote.gradient(L, x)[1]

You are right about my motivations here. The interest on gradient of g’g basically is that we are trying to find an alternative for ADAM. Thanks for you example and efforts here, and I will keep you informed if I find anything useful. but I doubt it since obviously you got more expertise on this : )

cortner · April 22, 2022, 2:41am

not necessarily - I usually try until it works and then move on :). I’d be grateful to hear if you learn more about this.

F-YF · August 22, 2022, 2:52am

What does the Zygote.hessian(f, x) mean? Is it implemented with mixing forward + reverse mode? Does it take the reverse derivative of f to get g, and then take the forward derivative of g to get h ?

marius311 · August 22, 2022, 5:56am

Yea. Think of it like Zygote generates some code which computes the gradient, then you push an array of ForwardDiff Dual’s through that code to get the jacobian (jac of grad being the hessian).

Should also be pretty equiv. to this:

using AbstractDifferentiation, LinearAlgebra, Zygote, ForwardDiff
AD.jacobian(AD.ForwardDiffBackend(), x -> AD.gradient(AD.ZygoteBackend(), x -> norm(x), x)[1], [1,2,3])[1]

which can also be written

AD.hessian(AD.HigherOrderBackend((AD.ForwardDiffBackend(), AD.ZygoteBackend())), norm, [1,2,3])[1]

(I’ve recently been playing more with AbstractDifferentiation.jl which IMO is coming out really nice, and an easy way to quickly swap out these different backends or try different combinations and see what works / is fast)

ToucheSir · August 23, 2022, 12:40am

I wonder if we should have documentation somewhere showing how to mix AD libraries. e.g. for the MWE in this thread, maybe Zygote over Tracker could work. That would require people to test out some combinations (i.e. non-existent extra maintainer time), so if anyone is interested let me know and I can help you get started.

F-YF · August 26, 2022, 9:54am

Why doesn’t this work? What’s the interface for taking the second derivative of a function of one variable?

AD.hessian(AD.HigherOrderBackend((AD.ForwardDiffBackend(), AD.ZygoteBackend())), x->x^3, 1)[1]

marius311 · August 26, 2022, 5:42pm

There is none, you’ll need to make it a length-1 vector and wrap/unwrap it:

AD.hessian(AD.HigherOrderBackend((AD.ForwardDiffBackend(), AD.ZygoteBackend())), x->x[1]^3, [1])[1][1]

I actually asked a related (yet-unanswered) question here: Whats the reason for the derivative / gradient difference? · Issue #61 · JuliaDiff/AbstractDifferentiation.jl · GitHub

Topic		Replies	Views
Newbie: Gradient of a gradient performance in Zygote General Usage zygote	2	511	March 21, 2021
Gradient of Gradient in Zygote General Usage	2	2737	January 1, 2021
Nested AD with Lux etc Machine Learning ad	26	1277	May 1, 2024
Differentiating Jacobian-vector product for sliced score matching? Machine Learning flux , zygote , ad	18	549	June 29, 2023
Mutation error in Zygote hessian Machine Learning question , package	2	184	June 22, 2023

Compute gradient of gradient norm using zygote

Related topics