Flux Zygote Gradient: Understanding Mutating arrays is not supported

rakeshvar · September 9, 2020, 11:36am

Namaste,
I am trying to implement CTC loss in Flux/Zygote. And running into Error: Mutating arrays is not supported. So I cooked up the toy example below to gain better understanding.

Given an input x, I perform some sort of “triangualtion" operation on it, where another matrix a is built based on x in a complicated way. The loss is the very last element of a. When I try to differentiate the loss, the first two implementations throw an error.

triangulate1 errors because (I am guessing) of setindex! operations?
triangulate3 works because of matrix operations, and no in-place editing of a (the vector)?
Why triangulate2 fails is beyond me… Can you please explain.
What is the best practice when it comes to complicated losses like this that depend on a complex function of the input?
Also, is it explained somewhere what ‘mutating arrays’ means and why it does not work?

FYI: I wrote a python implementation of CTC in Theano, and AD just worked, I did not have to manually calculate the gradients. I am hoping I can pull that off in Julia too.

Thank you.
Cash

using Zygote
using LinearAlgebra

function triangulate1(x::Matrix{T}) where T
    a = 0*x
    a[1, 1] = x[1, 1]
    for i in 2:size(x, 2)
        for j in 1:i
            a[j, i] = sum(a[1:j, i-1]) * x[j, i]
        end
    end
    a
end

function triangulate2(x::Matrix{T}) where T
    m, n = size(x)
    a = x[:,1] .* [1; zeros(T, m-1)]
    for i in 2:n
        a = accumulate(+, a) .* [x[1:i, i]; zeros(T, m-i)]
    end
    a
end

function triangulate3(x::Matrix{T}) where T
    m, n = size(x)
    a = x[:,1] .* [1; zeros(T, m-1)]
    A = LowerTriangular(fill(1, m, m))
    for i in 2:n
        a = A*a .* [x[1:i, i]; zeros(T, m-i)]
    end
    a
end

x1 = (1:4).*(1:4)'
# 4×4 Array{Int64,2}:
#  1  2   3   4
#  2  4   6   8
#  3  6   9  12
#  4  8  12  16

triangulate1(x1)
# 4×4 Array{Int64,2}:
#  1  2   6    24
#  0  4  36   336
#  0  0  54  1152
#  0  0   0  1536

triangulate2(x1)
# 4-element Array{Int64,1}:
#    24
#   336
#  1152
#  1536


triangulate3(x1)
# 4-element Array{Int64,1}:
#    24
#   336
#  1152
#  1536


gradient(x->triangulate1(x)[end], x1)[1]
# ERROR: Mutating arrays is not supported

gradient(x->triangulate2(x)[end], x1)[1]    
# ERROR: Mutating arrays is not supported
# (This takes much longer time to error out)

gradient(x->triangulate3(x)[end], x1)[1]
# 4×4 Array{Int64,2}:
#  1536  288  32   0
#     0  240  96   0
#     0    0  96   0
#     0    0   0  96

rakeshvar · September 9, 2020, 11:55am

Ideally I want triangulate2 to work as it seems to be way faster than the other two!
triangulate3 while being differentiable, is way too slow :

# Compile for floats
> x3 = randn(3, 3);  triangulate1(xf3); triangulate2(xf3); triangulate3(xf3);

> xf1k = randn(1000, 1000);
> @time triangulate1(xf1k);
  0.338104 seconds (500.50 k allocations: 1.319 GiB, 16.27% gc time)
> @time triangulate2(xf1k);
  0.004638 seconds (5.02 k allocations: 31.168 MiB)
> @time triangulate3(xf1k);
  8.077139 seconds (7.02 k allocations: 7.481 GiB, 0.80% gc time)

mcabbott · September 9, 2020, 1:22pm

You can do this, but it’s still pretty slow!

function triangulate4(x::AbstractMatrix)
    a = x[1,1]
    for t in 2:size(x,2)
        a = cumsum(vcat(a,0)) .* x[1:t,t]
    end
    a
end

# Zygote.gradient(x->triangulate1(x)[end], x1)
# Zygote.gradient(x->triangulate2(x)[end], x1)
Zygote.gradient(x->triangulate3(x)[end], x1)
Zygote.gradient(x->triangulate4(x)[end], x1)

#===== times =====#

julia> @btime triangulate1(x) setup=(x=randn(1000,1000));
  312.223 ms (500501 allocations: 1.32 GiB)

julia> @btime triangulate2(x) setup=(x=randn(1000,1000));
  4.782 ms (5016 allocations: 31.17 MiB)

julia> @btime triangulate3(x) setup=(x=randn(1000,1000));
  3.202 s (7016 allocations: 7.48 GiB)

julia> @btime triangulate4(x) setup=(x=randn(1000,1000));
  3.086 ms (21898 allocations: 16.48 MiB)

julia> @btime Zygote.gradient(x->triangulate3(x)[end], x) setup=(x=randn(1000,1000));
  15.796 s (48150 allocations: 44.73 GiB)

julia> @btime Zygote.gradient(x->triangulate4(x)[end], x) setup=(x=randn(1000,1000));
  4.376 s (61003 allocations: 14.93 GiB)

rakeshvar · September 9, 2020, 2:50pm

Thanks. So the execution of triangulate4 is as fast as triangulate2 and it is differentiable.
But the gradient is still slow. So there is a lot of scope here as a manual differentiation would be as fast as the original function, given the nature of the function here.

I was also looking for general pointers regarding such functions, as this is a toy problem that only mimics my more complex problem.

ToucheSir · September 9, 2020, 3:05pm

You may want to have a look at this PR: https://github.com/FluxML/Flux.jl/pull/1287

rakeshvar · September 9, 2020, 3:30pm

I am trying to make it simpler, make Zygote do the differentiating, instead of writing own gradient code. Also trying to understand how RNN like loopy things are handled by Flux/Zygote.

My python implementaion in theano is really simple, no manual gradients. Trying to get that in Julia too…

ToucheSir · September 9, 2020, 4:20pm

Do you happen to have that Julia equivalent of the linked theano CTC losses and some benchmark numbers for both? I can’t really see the equivalence between the triangulate* functions above (mutation + loop heavy, lots of single-element indexing) and plain_ctc/log_ctc (linalg heavy, scan instead of loops, mostly slice indexing). My hunch is a more direct translation of the latter would also be more Zygote-friendly.

Tamas_Papp · September 10, 2020, 8:31am

I found that the best way to use Zygote is let it figure out the adjoints for 90% the operations, and manually define the rest. In particular, Zygote fits best into a function/non-mutating programming style, and adjoints are sometimes much simpler (or just different) than the underlying operation.

rakeshvar · September 10, 2020, 11:24am

Thank you. Good to know about the 90% guideline.
Do you have a simple example where writing adjoint by our-selves makes more sense than using Zygote’s capabilities?

Tamas_Papp · September 10, 2020, 11:25am

Eg your example: the adjoint is rather simple and has a very regular structure.

compleat · November 15, 2020, 5:31pm

Hi. I have been trying to navigate Flux/Zygote and have similar questions and issues. Do you have a very basic example of how to supply an adjoint?
My thinking is that I would like to be able to supply high-level functional derivatives and let Zygote do the rest.

Also, I don’t know how to ‘unlink’ variables in my code from the Zygote differentiation.
‘Copy’ doesn’t work, in the sense that differentiation dependencies seem to be retained by copied variables. It also seems to be the case that the ‘no mutation of arrays’ applies to expressions even if they contain no dependency on variables or parameters, which (presumably) should not affect derivatives at all. I wonder why this needs to be the case for Zygote to work?

Thanks for any clarity.

rakeshvar · November 16, 2020, 5:37am

I am making a short tutorial on how to get around ‘mutating arrays’, in the mean-while I think you are looking for the @ignore macro from Zygote.

Albert_Zevelev · November 16, 2020, 5:46am

@rakeshvar I’m glad to see you back here!
Is there a chance you can provide a simple example of how to use Anyboost.jl?
Or are you no longer maintaining it?

Tamas_Papp · November 16, 2020, 7:12am

I would recommend the docs,

https://fluxml.ai/Zygote.jl/dev/adjoints/

compleat · November 16, 2020, 10:11am

Thanks for that Tamas, I had seen those but I couldn’t follow. Do you know where in the docs it explains what an adjoint is in the is context? I looked but could not find it clearly defined (maybe I just missed it).

compleat · November 16, 2020, 10:13am

Thanks so much, R, that @ignore macro looks like it will solve lots of my problems.

Tamas_Papp · November 16, 2020, 10:21am

For an introduction, I would recommend the ChainRules docs:

https://www.juliadiff.org/ChainRulesCore.jl/stable/

compleat · November 16, 2020, 10:38am

OK - thanks.

rakeshvar · November 27, 2020, 8:21am

Thank you Albert for your interest. It is very encouraging. The project was supposed to be academic, I am not sure how production stable it will be. But I will upgrade it to Julia 1.x, and write better docs and an example. But it might only be around April 2021.
(I am still at Sadhguru’s ashram and have to sneak away at times to code in Julia, mostly to escape the intense yoga routine that is on 24x7.)

rakeshvar · November 27, 2020, 8:24am

The docs of both Zygote and ChainRulesCore are a bit confusing and I had to go through them multiple times and filter what is relevant to me (an intermediate user). So I made up an example to learn better. Also, the Zygote api is changing a bit with @ignore, @nograd, etc. being a bit confusing and what not.

Topic		Replies	Views
Zygote debugging General Usage	6	2115	October 1, 2019
Gradient of a loss function : struggling to avoid arrays mutation New to Julia zygote , sciml	4	1509	December 7, 2020
Mutating arrays not supported New to Julia flux , zygote , forwarddiff , tullio	10	6947	July 3, 2020
Zygote.gradient(): Mutating arrays is not supported General Usage	1	800	August 18, 2020
Debugging Zygote: ERROR: Mutating arrays is not supported New to Julia flux , zygote	2	439	July 31, 2022

Flux Zygote Gradient: Understanding Mutating arrays is not supported

Related topics