Enzyme Cuda dynamic memory

Hello I am trying to use enzyme with dynamic shared memory but I get error that probably is related to the fact that I am trying to instantiate dynamic memory in not device code. However If not use it like that how to mark to Enzyme how it would allocate shaow memory for the dynamically allocated shared memory?

code:

using CUDA, Enzyme, Test

function mul_kernel(A, shared)
    i = threadIdx().x
    if i <= length(A)
        shared[i] = A[i] * A[i]
        A[i] = shared[i]
    end
    return nothing
end

function grad_mul_kernel(A, dA,sh,d_sh)
    Enzyme.autodiff_deferred(mul_kernel, Const, Duplicated(A, dA), Duplicated(sh, d_sh))
    return nothing
end

A = CUDA.ones(64,)
@cuda threads=length(A) shmem=64*4 mul_kernel(A,CuDynamicSharedArray(Float32, 64) )
A = CUDA.ones(64,)
dA = similar(A)
dA .= 1
@cuda threads=length(A) shmem=64*4 grad_mul_kernel(A, dA,CuDynamicSharedArray(Float32, 64),CuDynamicSharedArray(Float32, 64))
@test all(dA .== 2)

error

error: <inline asm>:1:16: invalid register name
        mov.u32 %edx, %dynamic_smem_size;

Why not create the dynamic shared memory for the derivative on the host as well then pass them both into grad kernel

1 Like

Thanks for the response! Hovewer even simple code like below, that do not yet count derivative do not work - so generally putting the instantiaton of the shared memory outside the kernel definition seem to be problematic. Hence I do not completely follow what is your suggestion, may you elaborate what would you modify in example above?

shmem_a=CuDynamicSharedArray(Float32, 64)
@cuda threads=length(A) shmem=64*4 mul_kernel(A,shmem_a )

Not quite Iā€™m thinking something like (on mobile untested)

@cuda threads=length(A) shmem=64*4 gradkernel(A, dA, CuDynamicSharedArray(Float32, 64), CuDynamicSharedArray(Float32, 64))

I modified code in the first post - here at the top to include suggestion still the same problem exist even on the level of mul_kernel - simple kernel execution before enzyme yet.