# Constructing a diagonal 2D CuArray while preserving Zygote's gradient functionality on the diagonal line

I am a beginner in the use of `CUDA` and `Zygote`, and is currently carrying my code from a CPU to a GPU version.

In my calculation, I need to put a 1D CuArray on the diagonal line of a 2D CuArray, and do some matrix multiplication stuff with that. I suppose the performance can get better if I refrain from converting it to CPU Array, and currently the only solution I have come up with is to use the sparse CuArray for my representation of the 2D CuArray (frankly I don’t necessarily need sparse arrays, I simply failed to come up with another solution that avoids both scalar-indexing and Zygote-related errors on my attempts with say `diagm`)

``````using Zygote
using CUDA
using LinearAlgebra

function ADdiag()
function cudiag(A)
N = length(A)
u = cu(collect(1:N))
nzval = u
colptr = u
dims = (N, N)
A = CUSPARSE.CuSparseMatrixCSC{Float64,Int32}(nzval, colptr, A, dims)
return norm(A)
end
f(x) = cudiag(x)
g(x) = gradient(f, x)[1]
initval = cu([1,2,3])
println("value: \$(f(initval))")
println("gradient: \$(g(initval))")
end

ADdiag()
``````

I can obtain the target value out of this code, but not when obtaining the gradient w.r.t. inputs:

Error message
``````ERROR: LoadError: `llvmcall` must be compiled to be called
Stacktrace:
[1] macro expansion
@ D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:101 [inlined]
[2] _pullback(::Zygote.Context{false}, ::Core.IntrinsicFunction, ::String, ::Type{Int64}, ::Type{Tuple{Ptr{Int64}}}, ::Ptr{Int64})
@ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:101
[3] _pullback
@ .\atomics.jl:358 [inlined]
[4] _pullback
@ D:\Julia\depot\packages\CUDA\rXson\lib\utils\threading.jl:27 [inlined]
[5] _pullback
@ D:\Julia\depot\packages\CUDA\rXson\lib\utils\memoization.jl:69 [inlined]
[6] _pullback(ctx::Zygote.Context{false}, f::typeof(CUDA.stream_ordered), args::CuDevice)
@ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
[7] _pullback
@ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:432 [inlined]
[8] _pullback
@ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:427 [inlined]
[9] _pullback
@ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:417 [inlined]
[10] _pullback
@ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:411 [inlined]
[11] _pullback
@ D:\Julia\depot\packages\CUDA\rXson\src\array.jl:74 [inlined]
[12] _pullback(::Zygote.Context{false}, ::Type{CuArray{Int32, 1, CUDA.Mem.DeviceBuffer}}, ::UndefInitializer, ::Tuple{Int64})
@ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
[13] _pullback
@ D:\Julia\depot\packages\CUDA\rXson\src\array.jl:415 [inlined]
[14] _pullback
@ D:\Julia\depot\packages\CUDA\rXson\src\array.jl:423 [inlined]
[15] _pullback
@ D:\Julia\depot\packages\GPUArrays\dAUOE\src\host\construction.jl:4 [inlined]
[16] _pullback(::Zygote.Context{false}, ::typeof(convert), ::Type{CuArray{Int32, 1}}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
@ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
[17] _pullback
@ D:\Julia\depot\packages\CUDA\rXson\lib\cusparse\array.jl:45 [inlined]
[18] _pullback(::Zygote.Context{false}, ::Type{CUDA.CUSPARSE.CuSparseMatrixCSC{Float64, Int32}}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Tuple{Int64, Int64})
@ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
[19] _pullback
@ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:13 [inlined]
[20] _pullback(ctx::Zygote.Context{false}, f::var"#cudiag#3", args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
@ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
[21] _pullback
@ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:16 [inlined]
[22] _pullback(ctx::Zygote.Context{false}, f::var"#f#4"{var"#cudiag#3"}, args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
@ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
[23] pullback(f::Function, cx::Zygote.Context{false}, args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
@ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:90
[24] pullback
@ D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:88 [inlined]
[25] gradient(f::Function, args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
@ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:147
[26] g
@ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:17 [inlined]
[27] ADdiag()
@ Main D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:20
[28] top-level scope
@ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:23
in expression starting at D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:23
``````

It somehow seems to me that I come to this `llvmcall` error every time when I want to construct a CuArray and backpropagation is passing that construction through.

In practice, this `[1,2,3]` does not come out of nowhere but be some results from other calculations so I have to keep track of the gradients on the diagonal line / for the 1D CuArray, and I guess `nograd` when I construct is not a solution.

In the end, my question has two parts:

1. For the purpose of constructing a diagonal matrix and preserving the ability to differentiate, what is the solution?
2. Is it correct to say that `Zygote` fails on every construction of CuArray inside the calculation? If not, what is the correct way to construct?

@amontoison this might be one for you

1 Like

I suspect that we need an extension for Zygote.jl in CUDA.jl to do that.
Enzyme.jl is probably better to do AD on GPUs.

1 Like

do some matrix multiplication stuff with that

Update: I kind of figured out a workaround on the diagonal matrix multiplication thing with element-wise product of CuArrays,

``````result = Mat .* transpose(Diag_vector)
``````

or

``````result = Diag_vector .* Mat
``````

which worked nicely with `Zygote`.

2 Likes