This isn’t an answer to the question you directly posed, but I think a better solution to your underlying problem is to use Enzyme.jl.
Here’s how I’d do this with Enzyme:
using LinearAlgebra, Enzyme
function setup_loglikelihood(X, Y)
Q, N = size(X)
D = size(Y, 1)
pred_storage = zeros(D, N)
function loglikelihood(param)
W = reshape(param, D, Q)
mul!(pred_storage, W, X)
-sum(abs2, (Y - pred_storage))
end
return loglikelihood
end
and then
# create mock data and parameter
X = randn(4,10) # mock inputs
Y = randn(5, 10) # mock outputs
W = randn(4,5) # mock parameters
logl = setup_loglikelihood(X, Y)
# Create a 'shadow' of W. After autodiff, this will have the property that
# dW[i] == ∂logl(W)[i]/∂W[i]
dW = make_zero(W)
# Create a 'shadow' of logl. This is necessary because logl is a closure that
# stores data which is mutated during differentiation.
dlogl = make_zero(logl)
# Run the autodiff
Enzyme.autodiff(Reverse, Duplicated(logl, dlogl), Duplicated(W, dW))
Looking at the gradient accumulated into dW
we see
julia> dW
4×5 Matrix{Float64}:
-20.4806 -1.63579 10.6586 6.69816 1.45061
-11.3634 16.5473 19.6695 -2.8299 -9.35841
26.4333 -0.970109 -19.8093 5.12396 23.1772
-4.74061 -14.9736 -8.79992 13.625 9.38448
This seems to agree with what I find using finite differences.