Error in Flux when calculating the gradient due to an in-place operation

I’m diving into Flux.jl for my research. My objective is to define a loss function that would measure, so to speak, the convergence of the model towards a uniform data distribution.

I generate K random fictitious observations and compare how many of them are smaller than the true data in the training set. In other words, the model generates K simulations, and we determine the number of simulations where the generated data is smaller than the actual data. If the model is well trained, this distribution should converge to a uniform distribution.

However, since the resulting histogram is not differentiable, I needed to approximate this idea using compositions of continuous differentiable functions.

η = 0.1; num_epochs = 100; n_samples = 1000
optim = Flux.setup(Flux.Adam(η), model)
losses = []
l = CustomLoss(2)
for epoch in 1:num_epochs
    loss, grads = Flux.withgradient(model) do m
        aₖ = zeros(l.K+1)
        for _ in 1:n_samples
            x = rand(Normal(μ, stddev), l.K)
            yₖ = m(x')
            y = target(rand(Normal(μ, stddev)))
           aₖ += generate_aₖ(l, yₖ, y)
        end
        scalar_diff(l, aₖ ./ sum(aₖ))
    end
    Flux.update!(optim, model, grads[1])
    push!(losses, loss)
end

So, generate_aₖ(l, yₖ, y) would return a vector of size K+1, where the element at index k is close to 1 if y is greater than k elements of the vector yₖ.

Finally, scalar_diff calculates the squared difference between the calculated aₖ and the uniform distribution.

However, this code is not functioning correctly, and I am encountering the following issue due to the line aₖ += generate_aₖ(l, yₖ, y),

Mutating arrays is not supported -- called copyto!(Vector{Float64}, ...)
This error occurs when you ask Zygote to differentiate operations that change
the elements of arrays in place (e.g. setting values with x .= ...)

I’m not sure how to solve this problem. Does anyone have any ideas for a workaround? I need to differentiate it against aₖ .

What do target() and CustomLoss() do? What is the objective of the model?

I think we mostly need to know what generate_ak does

Hello! Target is nothing more than the model I want to learn. In this case, it’s the simple line

line(x, m, b) = m * x + b
m = 3; b = 5
target(x) = line(x, m=m, b=b)

This is an example of a toy example, obviously, afterwards it will be the real observation.

CustomLoss is nothing more than a structure that contains the hyperparameter K , which represents the number of fictitious samples generated at each iteration. It can be seen as a parameter that measures the required resolution.

I have implemented generate_ak in the following way.

struct CustomLoss
    K::Int

    function CustomLoss(K::Int)
        new(K)
    end
end

function generate_aₖ(loss::CustomLoss, ŷ, y)
    aₖ = zeros(loss.K+1)
    @inbounds for k in 0:loss.K
        aₖ .+= γ(ŷ, y, k, loss.K+1)
    end
    return aₖ
end

scalar_diff = (loss::CustomLoss, a_k) -> sum((a_k .- (1 ./ (loss.K + 1))) .^2)
jensen_shannon_∇ = (loss::CustomLoss, a_k) -> jensen_shannon_divergence(a_k, fill(1 / (loss.K + 1), 1, loss.K + 1))

function jensen_shannon_divergence(p,q)
    ϵ = 1e-3
    p.+=ϵ; q.+=ϵ;
    return 0.5 * (kldivergence(p,q) + kldivergence(q,p))
end

function sigmoid(ŷ, y)
    return σ.((ŷ-y)*10)
end

function ψₘ(y, m)
    stddev = 0.1
    return exp.((-0.5 .* ((y .- m) ./ stddev) .^ 2))
end

function ϕ(yₖ, yₙ)
    return sum(sigmoid.(yₙ, yₖ))
end

function γ(yₖ, yₙ, m, K)
    eₘ = (n, m) -> [j == m ? 1.0 : 0.0 for j in 0:n-1]
    return eₘ(K, m) * ψₘ(ϕ(yₖ, yₙ), m)
end

As I mentioned, I need to transform the concept of counting (histogram) into a differentiable operation. To do this, I have done the following. First, I have used a sigmoid operation to check if a fictitious observation (out of the K generated) is smaller than the real observation. Obviously, this process can be repeated for the K observations by simply summing them (using the ϕ function). After this, I want to generate differentiable histogram bins. To achieve this, it is sufficient to use a bump function (I tried modifiers, but in the end, when running this example in PyTorch, I found that a normalized Gaussian worked better) that sums to nearly 1 when the real observation is greater than the K fictitious observations, and it is zero otherwise. We will have K+1 bump functions, each centered at “1, 2, …”. With the gamma (γ) function, I simply try to associate the previous result with the vector component i , which will represent my histogram bins. Finally, generated_ak is nothing more than an interaction over all the vector components to generate each of the bins.

It’s a bit convoluted, I’m sorry. It’s just that explaining the idea is not trivial. If you have any questions, please let me know. I’m trying to make this work in Julia because I’ve seen that there can be a substantial improvement in execution times and also because of the ease of parallelization it offers, which Python lacks.

From what I understand, there are two separate problems:

  • One is mathematical: differentiating through a random sample. Indeed, this requires either reparametrization or reinforce gradients.
  • One is practical: differentiating through a mutating function with Zygote.jl. The line I quoted above is presumably the one that breaks Flux.jl
1 Like

Thank you very much. Regarding the second issue, do you think it could be solved using Zygote.Buffer ? I’m a bit of a beginner with Flux, so I’m not sure.

This should work.

generate_aₖ(loss, ŷ, y) = [γ(ŷ, y, k, loss.K+1) for k in 0:loss.K]

But, it is still not clear to me. If we want to check distribution of model(x) is uniform, why not use something like moment matching?