Enzyme Cuda dynamic memory

Hello I am trying to use enzyme with dynamic shared memory but I get error that probably is related to the fact that I am trying to instantiate dynamic memory in not device code. However If not use it like that how to mark to Enzyme how it would allocate shaow memory for the dynamically allocated shared memory?

code:

using CUDA, Enzyme, Test

function mul_kernel(A, shared)
    i = threadIdx().x
    if i <= length(A)
        shared[i] = A[i] * A[i]
        A[i] = shared[i]
    end
    return nothing
end

function grad_mul_kernel(A, dA,sh,d_sh)
    Enzyme.autodiff_deferred(mul_kernel, Const, Duplicated(A, dA), Duplicated(sh, d_sh))
    return nothing
end

A = CUDA.ones(64,)
@cuda threads=length(A) shmem=64*4 mul_kernel(A,CuDynamicSharedArray(Float32, 64) )
A = CUDA.ones(64,)
dA = similar(A)
dA .= 1
@cuda threads=length(A) shmem=64*4 grad_mul_kernel(A, dA,CuDynamicSharedArray(Float32, 64),CuDynamicSharedArray(Float32, 64))
@test all(dA .== 2)

error

error: <inline asm>:1:16: invalid register name
        mov.u32 %edx, %dynamic_smem_size;

Why not create the dynamic shared memory for the derivative on the host as well then pass them both into grad kernel

1 Like

Thanks for the response! Hovewer even simple code like below, that do not yet count derivative do not work - so generally putting the instantiaton of the shared memory outside the kernel definition seem to be problematic. Hence I do not completely follow what is your suggestion, may you elaborate what would you modify in example above?

shmem_a=CuDynamicSharedArray(Float32, 64)
@cuda threads=length(A) shmem=64*4 mul_kernel(A,shmem_a )

Not quite I’m thinking something like (on mobile untested)

@cuda threads=length(A) shmem=64*4 gradkernel(A, dA, CuDynamicSharedArray(Float32, 64), CuDynamicSharedArray(Float32, 64))

I modified code in the first post - here at the top to include suggestion still the same problem exist even on the level of mul_kernel - simple kernel execution before enzyme yet.

Have you tried creating the shared memory inside the kernel as it is done here?

thanks @maxfreu I tried

using CUDA, Enzyme, Test

function mul_kernel(A)
    shared=CuDynamicSharedArray(Float32, length(A))
    i = threadIdx().x
    if i <= length(A)
        shared[i] = A[i] * A[i]
        A[i] = shared[i]
    end
    return nothing
end

function grad_mul_kernel(A, dA)
    Enzyme.autodiff_deferred(mul_kernel, Const, Duplicated(A, dA))
    return nothing
end

A = CUDA.ones(64,)
@cuda threads=length(A) shmem=64*4 mul_kernel(A )
A = CUDA.ones(64,)
dA = similar(A)
dA .= 1
@cuda threads=length(A) shmem=64*4 grad_mul_kernel(A, dA)
@test all(dA .== 2)

and got

ERROR: GPUCompiler.InvalidIRError(GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}(MethodInstance for grad_mul_kernel(::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}), GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}(GPUCompiler.PTXCompilerTarget(v"8.6.0", v"7.5.0", true, nothing, nothing, nothing, nothing, false, nothing, nothing), CUDA.CUDACompilerParams(v"8.6.0", v"8.2.0"), true, nothing, :specfunc, false, 2), 0x0000000000007b37), Tuple{String, Vector{Base.StackTraces.StackFrame}, Any}[("dynamic function invocation", [grad_mul_kernel at get_lin_synth_dat.jl:143], EnzymeCore.autodiff_deferred)])
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/validation.jl:147
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:460 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/Lw5SP/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:459 [inlined]
  [5] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/utils.jl:103
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/utils.jl:97 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:136
  [8] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:111
  [9] compile
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:103 [inlined]
 [10] #1145
    @ ~/.julia/packages/CUDA/75aiI/src/compiler/compilation.jl:254 [inlined]
 [11] JuliaContext(f::CUDA.var"#1145#1148"{GPUCompiler.CompilerJob{…}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:52
 [12] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:42
 [13] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/75aiI/src/compiler/compilation.jl:253
 [14] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:128
 [15] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:103
 [16] macro expansion
    @ ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:369 [inlined]
 [17] macro expansion
    @ ./lock.jl:267 [inlined]
 [18] cufunction(f::typeof(grad_mul_kernel), tt::Type{Tuple{CuDeviceVector{…}, CuDeviceVector{…}}}; kwargs::@Kwargs{})
    @ CUDA ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:364
 [19] cufunction(f::typeof(grad_mul_kernel), tt::Type{Tuple{CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}}})
    @ CUDA ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:361
 [20] top-level scope
    @ ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:112
Some type information was truncated. Use `show(err)` to see complete types.

So now it works with the mul_kernel, but not for the grad, right? Unfortunately, I don’t know any further than this. Looks like a bug in enzyme to me :-/

1 Like

Not sure why CUDA.jl doesn’t give nicer error messages here, but the Enzyme API needs the mode (e.g. Reverse) as the first argument

1 Like

Yes it was an issue, However when I added Enzyme.Reverse, now compilation hangs for couple hours. Still the issue is low priority for me at the moment, and I do not want to take you from other tasks to look into this.

Enzyme.autodiff_deferred(Enzyme.Reverse,mul_kernel, Const, Duplicated(A, dA))

@vchuravy could this be related to the deadlock issue you were lookint into earlier for michel?

1 Like

Not that only occurred with execution on the CPU and this seems to be GPU only?

yes gpu only