Flux Zygote Gradient: Understanding Mutating arrays is not supported

I am trying to implement CTC loss in Flux/Zygote. And running into Error: Mutating arrays is not supported. So I cooked up the toy example below to gain better understanding.

Given an input x, I perform some sort of “triangualtion" operation on it, where another matrix a is built based on x in a complicated way. The loss is the very last element of a. When I try to differentiate the loss, the first two implementations throw an error.

triangulate1 errors because (I am guessing) of setindex! operations?
triangulate3 works because of matrix operations, and no in-place editing of a (the vector)?
Why triangulate2 fails is beyond me… Can you please explain.
What is the best practice when it comes to complicated losses like this that depend on a complex function of the input?
Also, is it explained somewhere what ‘mutating arrays’ means and why it does not work?

FYI: I wrote a python implementation of CTC in Theano, and AD just worked, I did not have to manually calculate the gradients. I am hoping I can pull that off in Julia too. :slightly_smiling_face:

Thank you.

using Zygote
using LinearAlgebra

function triangulate1(x::Matrix{T}) where T
    a = 0*x
    a[1, 1] = x[1, 1]
    for i in 2:size(x, 2)
        for j in 1:i
            a[j, i] = sum(a[1:j, i-1]) * x[j, i]

function triangulate2(x::Matrix{T}) where T
    m, n = size(x)
    a = x[:,1] .* [1; zeros(T, m-1)]
    for i in 2:n
        a = accumulate(+, a) .* [x[1:i, i]; zeros(T, m-i)]

function triangulate3(x::Matrix{T}) where T
    m, n = size(x)
    a = x[:,1] .* [1; zeros(T, m-1)]
    A = LowerTriangular(fill(1, m, m))
    for i in 2:n
        a = A*a .* [x[1:i, i]; zeros(T, m-i)]

x1 = (1:4).*(1:4)'
# 4×4 Array{Int64,2}:
#  1  2   3   4
#  2  4   6   8
#  3  6   9  12
#  4  8  12  16

# 4×4 Array{Int64,2}:
#  1  2   6    24
#  0  4  36   336
#  0  0  54  1152
#  0  0   0  1536

# 4-element Array{Int64,1}:
#    24
#   336
#  1152
#  1536

# 4-element Array{Int64,1}:
#    24
#   336
#  1152
#  1536

gradient(x->triangulate1(x)[end], x1)[1]
# ERROR: Mutating arrays is not supported

gradient(x->triangulate2(x)[end], x1)[1]    
# ERROR: Mutating arrays is not supported
# (This takes much longer time to error out)

gradient(x->triangulate3(x)[end], x1)[1]
# 4×4 Array{Int64,2}:
#  1536  288  32   0
#     0  240  96   0
#     0    0  96   0
#     0    0   0  96

Ideally I want triangulate2 to work as it seems to be way faster than the other two!
triangulate3 while being differentiable, is way too slow :frowning_face: :

# Compile for floats
> x3 = randn(3, 3);  triangulate1(xf3); triangulate2(xf3); triangulate3(xf3);

> xf1k = randn(1000, 1000);
> @time triangulate1(xf1k);
  0.338104 seconds (500.50 k allocations: 1.319 GiB, 16.27% gc time)
> @time triangulate2(xf1k);
  0.004638 seconds (5.02 k allocations: 31.168 MiB)
> @time triangulate3(xf1k);
  8.077139 seconds (7.02 k allocations: 7.481 GiB, 0.80% gc time)

You can do this, but it’s still pretty slow!

function triangulate4(x::AbstractMatrix)
    a = x[1,1]
    for t in 2:size(x,2)
        a = cumsum(vcat(a,0)) .* x[1:t,t]

# Zygote.gradient(x->triangulate1(x)[end], x1)
# Zygote.gradient(x->triangulate2(x)[end], x1)
Zygote.gradient(x->triangulate3(x)[end], x1)
Zygote.gradient(x->triangulate4(x)[end], x1)

#===== times =====#

julia> @btime triangulate1(x) setup=(x=randn(1000,1000));
  312.223 ms (500501 allocations: 1.32 GiB)

julia> @btime triangulate2(x) setup=(x=randn(1000,1000));
  4.782 ms (5016 allocations: 31.17 MiB)

julia> @btime triangulate3(x) setup=(x=randn(1000,1000));
  3.202 s (7016 allocations: 7.48 GiB)

julia> @btime triangulate4(x) setup=(x=randn(1000,1000));
  3.086 ms (21898 allocations: 16.48 MiB)

julia> @btime Zygote.gradient(x->triangulate3(x)[end], x) setup=(x=randn(1000,1000));
  15.796 s (48150 allocations: 44.73 GiB)

julia> @btime Zygote.gradient(x->triangulate4(x)[end], x) setup=(x=randn(1000,1000));
  4.376 s (61003 allocations: 14.93 GiB)

Thanks. So the execution of triangulate4 is as fast as triangulate2 and it is differentiable.
But the gradient is still slow. So there is a lot of scope here as a manual differentiation would be as fast as the original function, given the nature of the function here.

I was also looking for general pointers regarding such functions, as this is a toy problem that only mimics my more complex problem.

You may want to have a look at this PR: https://github.com/FluxML/Flux.jl/pull/1287

1 Like

I am trying to make it simpler, make Zygote do the differentiating, instead of writing own gradient code. Also trying to understand how RNN like loopy things are handled by Flux/Zygote.

My python implementaion in theano is really simple, no manual gradients. Trying to get that in Julia too…

1 Like

Do you happen to have that Julia equivalent of the linked theano CTC losses and some benchmark numbers for both? I can’t really see the equivalence between the triangulate* functions above (mutation + loop heavy, lots of single-element indexing) and plain_ctc/log_ctc (linalg heavy, scan instead of loops, mostly slice indexing). My hunch is a more direct translation of the latter would also be more Zygote-friendly.

I found that the best way to use Zygote is let it figure out the adjoints for 90% the operations, and manually define the rest. In particular, Zygote fits best into a function/non-mutating programming style, and adjoints are sometimes much simpler (or just different) than the underlying operation.

Thank you. Good to know about the 90% guideline.
Do you have a simple example where writing adjoint by our-selves makes more sense than using Zygote’s capabilities?

Eg your example: the adjoint is rather simple and has a very regular structure.

Hi. I have been trying to navigate Flux/Zygote and have similar questions and issues. Do you have a very basic example of how to supply an adjoint?
My thinking is that I would like to be able to supply high-level functional derivatives and let Zygote do the rest.

Also, I don’t know how to ‘unlink’ variables in my code from the Zygote differentiation.
‘Copy’ doesn’t work, in the sense that differentiation dependencies seem to be retained by copied variables. It also seems to be the case that the ‘no mutation of arrays’ applies to expressions even if they contain no dependency on variables or parameters, which (presumably) should not affect derivatives at all. I wonder why this needs to be the case for Zygote to work?

Thanks for any clarity.

I am making a short tutorial on how to get around ‘mutating arrays’, in the mean-while I think you are looking for the @ignore macro from Zygote.


@rakeshvar I’m glad to see you back here!
Is there a chance you can provide a simple example of how to use Anyboost.jl?
Or are you no longer maintaining it?

I would recommend the docs,

Thanks for that Tamas, I had seen those but I couldn’t follow. Do you know where in the docs it explains what an adjoint is in the is context? I looked but could not find it clearly defined (maybe I just missed it).

Thanks so much, R, that @ignore macro looks like it will solve lots of my problems.

For an introduction, I would recommend the ChainRules docs:

OK - thanks.

Thank you Albert for your interest. It is very encouraging. The project was supposed to be academic, I am not sure how production stable it will be. But I will upgrade it to Julia 1.x, and write better docs and an example. But it might only be around April 2021.
(I am still at Sadhguru’s ashram and have to sneak away at times to code in Julia, mostly to escape the intense yoga routine that is on 24x7.) :slightly_smiling_face:

1 Like

The docs of both Zygote and ChainRulesCore are a bit confusing and I had to go through them multiple times and filter what is relevant to me (an intermediate user). So I made up an example to learn better. Also, the Zygote api is changing a bit with @ignore, @nograd, etc. being a bit confusing and what not.