Constructing a diagonal 2D CuArray while preserving Zygote's gradient functionality on the diagonal line

hxjz233 · June 5, 2024, 4:44pm

I am a beginner in the use of CUDA and Zygote, and is currently carrying my code from a CPU to a GPU version.

In my calculation, I need to put a 1D CuArray on the diagonal line of a 2D CuArray, and do some matrix multiplication stuff with that. I suppose the performance can get better if I refrain from converting it to CPU Array, and currently the only solution I have come up with is to use the sparse CuArray for my representation of the 2D CuArray (frankly I don’t necessarily need sparse arrays, I simply failed to come up with another solution that avoids both scalar-indexing and Zygote-related errors on my attempts with say diagm)

using Zygote
using CUDA
using LinearAlgebra

function ADdiag()
    function cudiag(A)
        N = length(A)
        u = cu(collect(1:N))
        nzval = u
        colptr = u
        dims = (N, N)
        A = CUSPARSE.CuSparseMatrixCSC{Float64,Int32}(nzval, colptr, A, dims)
        return norm(A)
    end
    f(x) = cudiag(x)
    g(x) = gradient(f, x)[1]
    initval = cu([1,2,3])
    println("value: $(f(initval))")
    println("gradient: $(g(initval))")
end

ADdiag()

I can obtain the target value out of this code, but not when obtaining the gradient w.r.t. inputs:

Error message

ERROR: LoadError: `llvmcall` must be compiled to be called
Stacktrace:
  [1] macro expansion
    @ D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:101 [inlined]
  [2] _pullback(::Zygote.Context{false}, ::Core.IntrinsicFunction, ::String, ::Type{Int64}, ::Type{Tuple{Ptr{Int64}}}, ::Ptr{Int64})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:101
  [3] _pullback
    @ .\atomics.jl:358 [inlined]
  [4] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\lib\utils\threading.jl:27 [inlined]
  [5] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\lib\utils\memoization.jl:69 [inlined]
  [6] _pullback(ctx::Zygote.Context{false}, f::typeof(CUDA.stream_ordered), args::CuDevice)
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
  [7] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:432 [inlined]
  [8] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:427 [inlined]
  [9] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:417 [inlined]
 [10] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\pool.jl:411 [inlined]
 [11] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\array.jl:74 [inlined]
 [12] _pullback(::Zygote.Context{false}, ::Type{CuArray{Int32, 1, CUDA.Mem.DeviceBuffer}}, ::UndefInitializer, ::Tuple{Int64})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [13] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\array.jl:415 [inlined]
 [14] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\src\array.jl:423 [inlined]
 [15] _pullback
    @ D:\Julia\depot\packages\GPUArrays\dAUOE\src\host\construction.jl:4 [inlined]
 [16] _pullback(::Zygote.Context{false}, ::typeof(convert), ::Type{CuArray{Int32, 1}}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [17] _pullback
    @ D:\Julia\depot\packages\CUDA\rXson\lib\cusparse\array.jl:45 [inlined]
 [18] _pullback(::Zygote.Context{false}, ::Type{CUDA.CUSPARSE.CuSparseMatrixCSC{Float64, Int32}}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Tuple{Int64, Int64})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [19] _pullback
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:13 [inlined]
 [20] _pullback(ctx::Zygote.Context{false}, f::var"#cudiag#3", args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [21] _pullback
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:16 [inlined]
 [22] _pullback(ctx::Zygote.Context{false}, f::var"#f#4"{var"#cudiag#3"}, args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface2.jl:0
 [23] pullback(f::Function, cx::Zygote.Context{false}, args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:90
 [24] pullback
    @ D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:88 [inlined]
 [25] gradient(f::Function, args::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote D:\Julia\depot\packages\Zygote\jxHJc\src\compiler\interface.jl:147
 [26] g
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:17 [inlined]
 [27] ADdiag()
    @ Main D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:20
 [28] top-level scope
    @ D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:23
in expression starting at D:\MagBEC\juliatest\t_adjulia\cuTensorAD4.jl:23

It somehow seems to me that I come to this llvmcall error every time when I want to construct a CuArray and backpropagation is passing that construction through.

In practice, this [1,2,3] does not come out of nowhere but be some results from other calculations so I have to keep track of the gradients on the diagonal line / for the 1D CuArray, and I guess nograd when I construct is not a solution.

In the end, my question has two parts:

For the purpose of constructing a diagonal matrix and preserving the ability to differentiate, what is the solution?
Is it correct to say that Zygote fails on every construction of CuArray inside the calculation? If not, what is the correct way to construct?

gdalle · June 5, 2024, 7:09pm

@amontoison this might be one for you

amontoison · June 5, 2024, 8:42pm

I suspect that we need an extension for Zygote.jl in CUDA.jl to do that.
Enzyme.jl is probably better to do AD on GPUs.

hxjz233 · June 6, 2024, 3:12pm

do some matrix multiplication stuff with that

Update: I kind of figured out a workaround on the diagonal matrix multiplication thing with element-wise product of CuArrays,

result = Mat .* transpose(Diag_vector)

or

result = Diag_vector .* Mat

which worked nicely with Zygote.

Topic		Replies	Views
Zygote errors on simple operations with Complex CUDA Arrays GPU cuda , zygote	0	391	May 4, 2021
How compute `gradient.(f, w)` on GPU? Machine Learning first-steps	2	913	November 16, 2019
Gradient for Cholesky decomposition with CuArrays with Zygote.jl General Usage question , linearalgebra , zygote , autodiff	0	304	May 15, 2023
Cannot take the CPU address of a CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer} in LinearAlgebra matmul GPU question , error	0	318	May 7, 2022
Pullback method error for Diagonal() call inside array generator General Usage flux , zygote	4	313	May 15, 2022

Constructing a diagonal 2D CuArray while preserving Zygote's gradient functionality on the diagonal line

Related topics