Why does Zygote produce a wrong derivative?

Trying to get my head around AD in Julia and being inspired by Jarret Revels’ example of computing the Jacobian of a cumprod() function with ForwardDiff.jl, I tried to see whether I could get the same done with Zygote.jl. I got an unexpected results with a wrong answer. Maybe someone can help explain where I went wrong or what the limitations of Zygote are.

First I realised that naive implementations of cumprod() including Base.cumprod() are not suitable for Zygote due to the use of array mutation. So I came up with my own function, which avoids mutating arrays and constructs the cum-product with an array comprehension and an accumulating function.

## accumulating function
mutable struct Accu{T} <: Function
    a::T
end

Accu(T::Type) = Accu(one(T))

function (c::Accu)(arg)
    c.a *= arg
end

function ac_cumprod(x)
    acc = Accu(eltype(x))
    return [acc(el) for el in x]
end

It seems to do what it should,

julia> ac_cumprod([1,2,3])
3-element Array{Int64,1}:
 1
 2
 6

and ForwardDiff computes the Jacobian correctly (and so does ReverseDiff)

julia> using ForwardDiff

julia> @btime ForwardDiff.jacobian(ac_cumprod, [1,2,3])
  442.434 ns (7 allocations: 864 bytes)
3×3 Array{Int64,2}:
 1  0  0
 2  1  0
 6  3  2

In order to compute the Jacobian with Zygote, I used code I found in pull request #747 (slightly modified):

using Zygote

"""
zy_jacobian(f,x)
Construct the Jacobian of `f` where `x` is a real-valued array
and `f(x)` is also a real-valued array.
"""
function zy_jacobian(f, x)
    y, back = Zygote.pullback(f, x)
    k = length(y)
    n = length(x)
    J = Matrix{eltype(y)}(undef, k, n)
    e_i = fill!(similar(x), 0)
    @inbounds for i = 1:k
        e_i[i] = oneunit(eltype(x))
        J[i, :] = back(e_i)[1]
        e_i[i] = zero(eltype(x))
    end
    J
end

Using this function to compute the Jacobian of my function, however, yields a wrong result:

julia> @btime zy_jacobian(ac_cumprod, [1,2,3])
  76.221 μs (333 allocations: 12.16 KiB)
3×3 Array{Int64,2}:
  1   0   0
  8   4   2
 54  27  18

Moreover, execution is orders of magnitude slower that with ForwardDiff. So where did things go wrong?

i suspect that the culprit of the error is that mutating structure Accu. this implementation (using a naive cumprod approach) returns correct values with zy_jacobian:

function mycumprod(x)
    return [prod(@view x[begin:i]) for i in eachindex(x)]
end

for comparison, ac_cumprod:

julia> @btime zy_jacobian(ac_cumprod,[1,2,3])
  100.801 μs (333 allocations: 12.16 KiB)
3×3 Array{Int64,2}:
  1   0   0
  8   4   2
 54  27  18

naive cumprod, mycumprod:

julia> @btime zy_jacobian(mycumprod,[1,2,3])
  26.401 μs (229 allocations: 11.42 KiB)
3×3 Array{Int64,2}:
 1  0  0
 2  1  0
 6  3  2

the definitive solution for this though, is adding a derivative rule for cumprod on ChainRules.jl. i added an issue

2 Likes

Agree that mutation is likely the problem, surprised it doesn’t give an error. Here’s a simple adaptation without mutation. Still very slow compared to ForwardDiff here:

julia> function cumprod_2(xs)
           acc = one(eltype(xs))
           map(xs) do x
             acc = acc * x
           end
       end
cumprod_2 (generic function with 1 method)

julia> cumprod_2([1,2,3])
3-element Array{Int64,1}:
 1
 2
 6

julia> zy_jacobian(cumprod_2, [1,2,3])
3×3 Array{Int64,2}:
 1  0  0
 2  1  0
 6  3  2

julia> @btime ForwardDiff.jacobian(ac_cumprod, xs) setup=(xs=randn(100));
  44.594 μs (22 allocations: 183.45 KiB)

julia> @btime ForwardDiff.jacobian(cumprod_2, xs) setup=(xs=randn(100));
  74.680 μs (1840 allocations: 380.05 KiB)

julia> @btime zy_jacobian(cumprod_2, xs) setup=(xs=randn(100)); # my scalar accumulation
  80.275 ms (273341 allocations: 5.95 MiB)

julia> @btime zy_jacobian(mycumprod, xs) setup=(xs=randn(100)); # from longemen3000, views
  41.428 ms (215628 allocations: 26.61 MiB)

julia> @btime zy_jacobian(xs -> Zygote.forwarddiff(cumprod, xs), xs) setup=(xs=randn(100)); # calling ForwardDiff within the Zygote jacobian, wasteful
  277.565 μs (858 allocations: 541.48 KiB)
2 Likes

reverse mode is faster when the dimension of the input is much greater than the dimension of the output. in this case, this isnt true, so forward mode is algorithmically faster here

Many thanks for the answers including the improved code examples!

I had this suspicion as well, although the documentation reads like mutating data structures should be safe to use.

That may well be the case, but why is then

so particularly slow or wasteful? Shouldn’t injecting forward differentiation into Zygote be the solution to avoiding the algorithmically slower reverse mode AD?

I thought the complexity of forward and backward mode for a Jacobian were the same (for an N->N function like this), but the overhead of ForwardDiff is much lower.

I said “wasteful” because Zygote.forwarddiff computes the full Jacobian (before contracting to get the gradient), and this is called N times by zy_jacobian. But then it should be 100x slower, in my example, and it’s not. It turns out that J is cached: https://github.com/FluxML/Zygote.jl/blob/master/src/lib/forward.jl#L100 computes it once when you call Zygote.pullback(f, x), and re-uses that on every call of back.

2 Likes