Differentiating the Custom Layer and Regularizer

I don’t understand the error Mutating arrays is not supported. I couldn’t understand the previous posts on this topic either.

I’m trying to add a very standard weight decay regulariser R(model) to my model simply defined as
sum(sqnorm, Flux.params(model))
where sqnorm(x) = sum(abs2, x) just as the tutorial webpages define it.

And in doing gradient( () -> R(model), Flux.params(model)) I get the annoying message above.

But it doesn’t end there!

I created a custom layer as a struct with positives and negative weights and biases. I get the same message when I apply
gradient(() -> Flux.Losses.logitcrossentropy(mapslices(model, x_train, dims = 1), y_train), Flux.params(model))
I get the same error message. Actually this also happens when I create a vanilla chained Dense layers. So my custom layer may also contain issues, but the problem seems to be even more basic.

I would very much appreciate if somebody could explain what simple dumb thing I’m doing here. (probably also affecting the performance, is it normal that a full batch loss evaluation takes about 5 seconds on MNIST for a neural network with 28x28, 600, 200, 10 layers?)

This ought to work:

julia> using Flux

julia> sqnorm(x) = sum(abs2, x);

julia> R(model) = sum(sqnorm, Flux.params(model));

julia> model = Chain(Dense(2 => 2, tanh), Dense(2 => 2));

julia> gradient(R, model)
((layers = ((weight = Float32[-1.114341 -0.9922521; 1.7604426 1.4203818], bias = Float32[0.0, 0.0], σ = nothing), (weight = Float32[-2.3257902 -0.59381634; -0.35862422 0.3657079], bias = Float32[0.0, 0.0], σ = nothing)),),)

julia> gr = gradient(() -> R(model), Flux.params(model))
Grads(...)

julia> gr[model[1].weight]  # the same as explicit version
2×2 Matrix{Float32}:
 -1.11434  -0.992252
  1.76044   1.42038

This won’t work right now, sadly. You can try SliceMap; something like that could now be built in but nobody got around to it.

Thank you very much.

For the regularizer it worked but only after updating the version of Julia to 1.7.3 from 1.5.sth.

And also thank you on using SliceMap.mapcols functionality. This works.

But now I have a followup question. What I do isn’t crazy, it should be standard. Do people really need some extra packages outside of Flux in order to calculate loss functions and their gradients? What is the standard way of applying the model to the data points (which are given as columns of a matrix).

I wouldn’t have asked if it weren’t for the fact that calculating the gradient on my medium sized neural network with full batch MNIST took about ~ 50 seconds. That seems to be too high. It was also similar when I used a vanilla dense 3 layer network, and not my custom network.

Also may I ask what was the problem with the Mutating arrays is not supported error message? Maybe it will be too hard for me to understand, but for posterity’s sake, could you explain? What in mapslices made that error appear?

If you’re referring to including a regularization penalty in the loss, it is standard and @mcabbott’s snippet above shows that it works. Note also WeightDecay which works much as it does in other frameworks: a more efficient way to add L2 regularization to a network.

No. I think the confusion comes from this:

I presume you’re familiar with Python frameworks, so this is basically equivalent to:

criterion = nn.CrossEntropyLoss()
outputs = []
for xs in x_train:
  outputs.append(model(xs))
loss = criterion(outputs, y_train)
loss.backward()

Which likewise does not run. Much like you’d call .backward once per batch in PyTorch, gradient should be called once per batch and not across all batches simultaneously. We try to cover this in Fitting a Line · Flux, so please let us know on the issue tracker if any part of that is unclear.

This also likely explains:

The other part is that the Julia JIT needs some warmup time to compile everything for the first run of a model. Assuming the loop above is tweaked to work properly with batch(es) as I described earlier, the first run time should be lower and any subsequent runs should be pretty much (sometimes slightly faster/slower) what you’d expect from a Python framework.

For posterity, the ChainRules docs have a nice section on this topic.

If you look at the part of the error message after the “not supported”, it should tell you which function Zygote is complaining about. Most of the time, that is setindex! (i.e. array[i] = ...), push!/pop! or copyto! (which comes up in .= broadcasting).

If you then look at the stacktrace, you can roughly see where in the call stack is calling a function that does array mutation. In this case, we know it’s somewhere in the implementation of mapslices. Now, mapslices does not mutate any of its inputs, but there isn’t an explicit AD rule for Zygote to use on it, Zygote will try to drill down as far as it can until it finds a function with an AD rule. This usually works fine for most functions (it’s the “automatic” part of automatic differentiation), but if things bottom out in one of the mutating functions mentioned previously then you’ll receive a mutating arrays error.

2 Likes

Just to remove any ambiguity about how to run a network over a batch (full or mini) of data in Flux (and most other Julia DL libraries):

m = Chain(...)
# dummy input, note that the dimensions are *reversed* 
# compared to what you'd use for Python libraries
x = ones(features, batch_size) 
outputs = m(x)
# both outputs and targets have shape (output_features = labels) x batch_size
loss = logitcrossentropy(outputs, targets)

So other than the order of dimensions, everything runs fully vectorized and looks very similar to the equivalent Python.

2 Likes

Thank you very much.

Let me maybe also mention why I bothered with mapslices and the like so that other people don’t do the same mistake.

The problem was with my custom layer. When I input model(x_train) for x of size 784 x 60000 it didn’t work. The dimensions didn’t match.

Now I checked the definition of Dense layer from the source and saw the mistake of my ways.

I made my layer so that the dense layer could only eat a vector, because as a function it added the bias as a vector. However I should have done .+ thus broadcasting the bias if necessary and hence allowing for matrices x as inputs.

It also had the added benefit of naturally having the performance gain once I gpu’d everything (which was unexpectedly easy to do, though one hickup was that DataLoader outputs normal arrays even if it is fed by CUArrays.)

Once again, thank you.

Glad to hear things worked out!

This should not be happening, do you have a MWE?

1 Like

You’re right. It didn’t happen after I closed and ran the Julia code again. My mistake, must have still referenced a cpu version of the DataLoader in the memory.