Constructing a diagonal 2D CuArray while preserving Zygote's gradient functionality on the diagonal line

I am a beginner in the use of CUDA and Zygote, and is currently carrying my code from a CPU to a GPU version.

In my calculation, I need to put a 1D CuArray on the diagonal line of a 2D CuArray, and do some matrix multiplication stuff with that. I suppose the performance can get better if I refrain from converting it to CPU Array, and currently the only solution I have come up with is to use the sparse CuArray for my representation of the 2D CuArray (frankly I don’t necessarily need sparse arrays, I simply failed to come up with another solution that avoids both scalar-indexing and Zygote-related errors on my attempts with say diagm)

using Zygote
using CUDA
using LinearAlgebra

function ADdiag()
    function cudiag(A)
        N = length(A)
        u = cu(collect(1:N))
        nzval = u
        colptr = u
        dims = (N, N)
        A = CUSPARSE.CuSparseMatrixCSC{Float64,Int32}(nzval, colptr, A, dims)
        return norm(A)
    end
    f(x) = cudiag(x)
    g(x) = gradient(f, x)[1]
    initval = cu([1,2,3])
    println("value: $(f(initval))")
    println("gradient: $(g(initval))")
end

ADdiag()

I can obtain the target value out of this code, but not when obtaining the gradient w.r.t. inputs:

Error message
ERROR: LoadError: `llvmcall` must be compiled to be called
Stacktrace:
  [1] macro expansion
    @ D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:101 [inlined]
  [2] _pullback(::Zygote.Context{false}, ::Core.IntrinsicFunction, ::String, ::Type{Int64}, ::Type{Tuple{Ptr{Int64}}}, ::Ptr{Int64})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:101
  [3] _pullback
    @ .\atomics.jl:358 [inlined]
  [4] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\lib\utils\threading.jl:27 [inlined]
  [5] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\lib\utils\memoization.jl:69 [inlined]
  [6] _pullback(ctx::Zygote.Context{false}, f::typeof(CUDA.stream_ordered), args::CuDevice)
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
  [7] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:432 [inlined]
  [8] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:427 [inlined]
  [9] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:417 [inlined]
 [10] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:411 [inlined]
 [11] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\array.jl:74 [inlined]
 [12] _pullback(::Zygote.Context{false}, ::Type{CuArray{Int32, 1, CUDA.Mem.DeviceBuffer}}, ::UndefInitializer, ::Tuple{Int64})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [13] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\array.jl:415 [inlined]
 [14] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\array.jl:423 [inlined]
 [15] _pullback
    @ D:\Julia\depot\packages\GPUArrays\dAUOE\src\host\construction.jl:4 [inlined]
 [16] _pullback(::Zygote.Context{false}, ::typeof(convert), ::Type{CuArray{Int32, 1}}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [17] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\lib\cusparse\array.jl:45 [inlined]
 [18] _pullback(::Zygote.Context{false}, ::Type{CUDA.CUSPARSE.CuSparseMatrixCSC{Float64, Int32}}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Tuple{Int64, Int64})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [19] _pullback
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:13 [inlined]
 [20] _pullback(ctx::Zygote.Context{false}, f::var"#cudiag#3", args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [21] _pullback
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:16 [inlined]
 [22] _pullback(ctx::Zygote.Context{false}, f::var"#f#4"{var"#cudiag#3"}, args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [23] pullback(f::Function, cx::Zygote.Context{false}, args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:90
 [24] pullback
    @ D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:88 [inlined]
 [25] gradient(f::Function, args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:147
 [26] g
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:17 [inlined]
 [27] ADdiag()
    @ Main D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:20
 [28] top-level scope
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:23
in expression starting at D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:23

It somehow seems to me that I come to this llvmcall error every time when I want to construct a CuArray and backpropagation is passing that construction through.

In practice, this [1,2,3] does not come out of nowhere but be some results from other calculations so I have to keep track of the gradients on the diagonal line / for the 1D CuArray, and I guess nograd when I construct is not a solution.

In the end, my question has two parts:

  1. For the purpose of constructing a diagonal matrix and preserving the ability to differentiate, what is the solution?
  2. Is it correct to say that Zygote fails on every construction of CuArray inside the calculation? If not, what is the correct way to construct?

@amontoison this might be one for you

1 Like

I suspect that we need an extension for Zygote.jl in CUDA.jl to do that.
Enzyme.jl is probably better to do AD on GPUs.

1 Like

do some matrix multiplication stuff with that

Update: I kind of figured out a workaround on the diagonal matrix multiplication thing with element-wise product of CuArrays,

result = Mat .* transpose(Diag_vector)

or

result = Diag_vector .* Mat

which worked nicely with Zygote.

2 Likes