Flux Zygote Gradient: Understanding Mutating arrays is not supported

I am trying to implement CTC loss in Flux/Zygote. And running into Error: Mutating arrays is not supported. So I cooked up the toy example below to gain better understanding.

Given an input x, I perform some sort of “triangualtion" operation on it, where another matrix a is built based on x in a complicated way. The loss is the very last element of a. When I try to differentiate the loss, the first two implementations throw an error.

triangulate1 errors because (I am guessing) of setindex! operations?
triangulate3 works because of matrix operations, and no in-place editing of a (the vector)?
Why triangulate2 fails is beyond me… Can you please explain.
What is the best practice when it comes to complicated losses like this that depend on a complex function of the input?
Also, is it explained somewhere what ‘mutating arrays’ means and why it does not work?

FYI: I wrote a python implementation of CTC in Theano, and AD just worked, I did not have to manually calculate the gradients. I am hoping I can pull that off in Julia too. :slightly_smiling_face:

Thank you.

using Zygote
using LinearAlgebra

function triangulate1(x::Matrix{T}) where T
    a = 0*x
    a[1, 1] = x[1, 1]
    for i in 2:size(x, 2)
        for j in 1:i
            a[j, i] = sum(a[1:j, i-1]) * x[j, i]

function triangulate2(x::Matrix{T}) where T
    m, n = size(x)
    a = x[:,1] .* [1; zeros(T, m-1)]
    for i in 2:n
        a = accumulate(+, a) .* [x[1:i, i]; zeros(T, m-i)]

function triangulate3(x::Matrix{T}) where T
    m, n = size(x)
    a = x[:,1] .* [1; zeros(T, m-1)]
    A = LowerTriangular(fill(1, m, m))
    for i in 2:n
        a = A*a .* [x[1:i, i]; zeros(T, m-i)]

x1 = (1:4).*(1:4)'
# 4×4 Array{Int64,2}:
#  1  2   3   4
#  2  4   6   8
#  3  6   9  12
#  4  8  12  16

# 4×4 Array{Int64,2}:
#  1  2   6    24
#  0  4  36   336
#  0  0  54  1152
#  0  0   0  1536

# 4-element Array{Int64,1}:
#    24
#   336
#  1152
#  1536

# 4-element Array{Int64,1}:
#    24
#   336
#  1152
#  1536

gradient(x->triangulate1(x)[end], x1)[1]
# ERROR: Mutating arrays is not supported

gradient(x->triangulate2(x)[end], x1)[1]    
# ERROR: Mutating arrays is not supported
# (This takes much longer time to error out)

gradient(x->triangulate3(x)[end], x1)[1]
# 4×4 Array{Int64,2}:
#  1536  288  32   0
#     0  240  96   0
#     0    0  96   0
#     0    0   0  96

Ideally I want triangulate2 to work as it seems to be way faster than the other two!
triangulate3 while being differentiable, is way too slow :frowning_face: :

# Compile for floats
> x3 = randn(3, 3);  triangulate1(xf3); triangulate2(xf3); triangulate3(xf3);

> xf1k = randn(1000, 1000);
> @time triangulate1(xf1k);
  0.338104 seconds (500.50 k allocations: 1.319 GiB, 16.27% gc time)
> @time triangulate2(xf1k);
  0.004638 seconds (5.02 k allocations: 31.168 MiB)
> @time triangulate3(xf1k);
  8.077139 seconds (7.02 k allocations: 7.481 GiB, 0.80% gc time)

You can do this, but it’s still pretty slow!

function triangulate4(x::AbstractMatrix)
    a = x[1,1]
    for t in 2:size(x,2)
        a = cumsum(vcat(a,0)) .* x[1:t,t]

# Zygote.gradient(x->triangulate1(x)[end], x1)
# Zygote.gradient(x->triangulate2(x)[end], x1)
Zygote.gradient(x->triangulate3(x)[end], x1)
Zygote.gradient(x->triangulate4(x)[end], x1)

#===== times =====#

julia> @btime triangulate1(x) setup=(x=randn(1000,1000));
  312.223 ms (500501 allocations: 1.32 GiB)

julia> @btime triangulate2(x) setup=(x=randn(1000,1000));
  4.782 ms (5016 allocations: 31.17 MiB)

julia> @btime triangulate3(x) setup=(x=randn(1000,1000));
  3.202 s (7016 allocations: 7.48 GiB)

julia> @btime triangulate4(x) setup=(x=randn(1000,1000));
  3.086 ms (21898 allocations: 16.48 MiB)

julia> @btime Zygote.gradient(x->triangulate3(x)[end], x) setup=(x=randn(1000,1000));
  15.796 s (48150 allocations: 44.73 GiB)

julia> @btime Zygote.gradient(x->triangulate4(x)[end], x) setup=(x=randn(1000,1000));
  4.376 s (61003 allocations: 14.93 GiB)

Thanks. So the execution of triangulate4 is as fast as triangulate2 and it is differentiable.
But the gradient is still slow. So there is a lot of scope here as a manual differentiation would be as fast as the original function, given the nature of the function here.

I was also looking for general pointers regarding such functions, as this is a toy problem that only mimics my more complex problem.

You may want to have a look at this PR: https://github.com/FluxML/Flux.jl/pull/1287

1 Like

I am trying to make it simpler, make Zygote do the differentiating, instead of writing own gradient code. Also trying to understand how RNN like loopy things are handled by Flux/Zygote.

My python implementaion in theano is really simple, no manual gradients. Trying to get that in Julia too…

1 Like

Do you happen to have that Julia equivalent of the linked theano CTC losses and some benchmark numbers for both? I can’t really see the equivalence between the triangulate* functions above (mutation + loop heavy, lots of single-element indexing) and plain_ctc/log_ctc (linalg heavy, scan instead of loops, mostly slice indexing). My hunch is a more direct translation of the latter would also be more Zygote-friendly.

I found that the best way to use Zygote is let it figure out the adjoints for 90% the operations, and manually define the rest. In particular, Zygote fits best into a function/non-mutating programming style, and adjoints are sometimes much simpler (or just different) than the underlying operation.

Thank you. Good to know about the 90% guideline.
Do you have a simple example where writing adjoint by our-selves makes more sense than using Zygote’s capabilities?

Eg your example: the adjoint is rather simple and has a very regular structure.