Why does Zygote produce a wrong derivative?

Trying to get my head around AD in Julia and being inspired by Jarret Revels’ example of computing the Jacobian of a `cumprod()` function with `ForwardDiff.jl`, I tried to see whether I could get the same done with `Zygote.jl`. I got an unexpected results with a wrong answer. Maybe someone can help explain where I went wrong or what the limitations of `Zygote` are.

First I realised that naive implementations of `cumprod()` including `Base.cumprod()` are not suitable for `Zygote` due to the use of array mutation. So I came up with my own function, which avoids mutating arrays and constructs the cum-product with an array comprehension and an accumulating function.

``````## accumulating function
mutable struct Accu{T} <: Function
a::T
end

Accu(T::Type) = Accu(one(T))

function (c::Accu)(arg)
c.a *= arg
end

function ac_cumprod(x)
acc = Accu(eltype(x))
return [acc(el) for el in x]
end
``````

It seems to do what it should,

``````julia> ac_cumprod([1,2,3])
3-element Array{Int64,1}:
1
2
6
``````

and `ForwardDiff` computes the Jacobian correctly (and so does `ReverseDiff`)

``````julia> using ForwardDiff

julia> @btime ForwardDiff.jacobian(ac_cumprod, [1,2,3])
442.434 ns (7 allocations: 864 bytes)
3×3 Array{Int64,2}:
1  0  0
2  1  0
6  3  2
``````

In order to compute the Jacobian with `Zygote`, I used code I found in pull request #747 (slightly modified):

``````using Zygote

"""
zy_jacobian(f,x)
Construct the Jacobian of `f` where `x` is a real-valued array
and `f(x)` is also a real-valued array.
"""
function zy_jacobian(f, x)
y, back = Zygote.pullback(f, x)
k = length(y)
n = length(x)
J = Matrix{eltype(y)}(undef, k, n)
e_i = fill!(similar(x), 0)
@inbounds for i = 1:k
e_i[i] = oneunit(eltype(x))
J[i, :] = back(e_i)[1]
e_i[i] = zero(eltype(x))
end
J
end
``````

Using this function to compute the Jacobian of my function, however, yields a wrong result:

``````julia> @btime zy_jacobian(ac_cumprod, [1,2,3])
76.221 μs (333 allocations: 12.16 KiB)
3×3 Array{Int64,2}:
1   0   0
8   4   2
54  27  18
``````

Moreover, execution is orders of magnitude slower that with `ForwardDiff`. So where did things go wrong?

i suspect that the culprit of the error is that mutating structure `Accu`. this implementation (using a naive `cumprod` approach) returns correct values with `zy_jacobian`:

``````function mycumprod(x)
return [prod(@view x[begin:i]) for i in eachindex(x)]
end
``````

for comparison, `ac_cumprod`:

``````julia> @btime zy_jacobian(ac_cumprod,[1,2,3])
100.801 μs (333 allocations: 12.16 KiB)
3×3 Array{Int64,2}:
1   0   0
8   4   2
54  27  18
``````

naive cumprod, `mycumprod`:

``````julia> @btime zy_jacobian(mycumprod,[1,2,3])
26.401 μs (229 allocations: 11.42 KiB)
3×3 Array{Int64,2}:
1  0  0
2  1  0
6  3  2
``````

the definitive solution for this though, is adding a derivative rule for `cumprod` on ChainRules.jl. i added an issue

2 Likes

Agree that mutation is likely the problem, surprised it doesn’t give an error. Here’s a simple adaptation without mutation. Still very slow compared to ForwardDiff here:

``````julia> function cumprod_2(xs)
acc = one(eltype(xs))
map(xs) do x
acc = acc * x
end
end
cumprod_2 (generic function with 1 method)

julia> cumprod_2([1,2,3])
3-element Array{Int64,1}:
1
2
6

julia> zy_jacobian(cumprod_2, [1,2,3])
3×3 Array{Int64,2}:
1  0  0
2  1  0
6  3  2

julia> @btime ForwardDiff.jacobian(ac_cumprod, xs) setup=(xs=randn(100));
44.594 μs (22 allocations: 183.45 KiB)

julia> @btime ForwardDiff.jacobian(cumprod_2, xs) setup=(xs=randn(100));
74.680 μs (1840 allocations: 380.05 KiB)

julia> @btime zy_jacobian(cumprod_2, xs) setup=(xs=randn(100)); # my scalar accumulation
80.275 ms (273341 allocations: 5.95 MiB)

julia> @btime zy_jacobian(mycumprod, xs) setup=(xs=randn(100)); # from longemen3000, views
41.428 ms (215628 allocations: 26.61 MiB)

julia> @btime zy_jacobian(xs -> Zygote.forwarddiff(cumprod, xs), xs) setup=(xs=randn(100)); # calling ForwardDiff within the Zygote jacobian, wasteful
277.565 μs (858 allocations: 541.48 KiB)
``````
2 Likes

reverse mode is faster when the dimension of the input is much greater than the dimension of the output. in this case, this isnt true, so forward mode is algorithmically faster here

Many thanks for the answers including the improved code examples!

I had this suspicion as well, although the documentation reads like mutating data structures should be safe to use.

That may well be the case, but why is then

so particularly slow or wasteful? Shouldn’t injecting forward differentiation into Zygote be the solution to avoiding the algorithmically slower reverse mode AD?

I thought the complexity of forward and backward mode for a Jacobian were the same (for an N->N function like this), but the overhead of ForwardDiff is much lower.

I said “wasteful” because `Zygote.forwarddiff` computes the full Jacobian (before contracting to get the gradient), and this is called N times by `zy_jacobian`. But then it should be 100x slower, in my example, and it’s not. It turns out that `J` is cached: https://github.com/FluxML/Zygote.jl/blob/master/src/lib/forward.jl#L100 computes it once when you call `Zygote.pullback(f, x)`, and re-uses that on every call of `back`.

2 Likes