Enzyme Cuda dynamic memory

Jakub_Mitura · March 2, 2024, 3:46pm

Hello I am trying to use enzyme with dynamic shared memory but I get error that probably is related to the fact that I am trying to instantiate dynamic memory in not device code. However If not use it like that how to mark to Enzyme how it would allocate shaow memory for the dynamically allocated shared memory?

code:

using CUDA, Enzyme, Test

function mul_kernel(A, shared)
    i = threadIdx().x
    if i <= length(A)
        shared[i] = A[i] * A[i]
        A[i] = shared[i]
    end
    return nothing
end

function grad_mul_kernel(A, dA,sh,d_sh)
    Enzyme.autodiff_deferred(mul_kernel, Const, Duplicated(A, dA), Duplicated(sh, d_sh))
    return nothing
end

A = CUDA.ones(64,)
@cuda threads=length(A) shmem=64*4 mul_kernel(A,CuDynamicSharedArray(Float32, 64) )
A = CUDA.ones(64,)
dA = similar(A)
dA .= 1
@cuda threads=length(A) shmem=64*4 grad_mul_kernel(A, dA,CuDynamicSharedArray(Float32, 64),CuDynamicSharedArray(Float32, 64))
@test all(dA .== 2)

error

error: <inline asm>:1:16: invalid register name
        mov.u32 %edx, %dynamic_smem_size;

wsmoses · March 2, 2024, 5:49pm

Why not create the dynamic shared memory for the derivative on the host as well then pass them both into grad kernel

Jakub_Mitura · March 2, 2024, 5:55pm

Thanks for the response! Hovewer even simple code like below, that do not yet count derivative do not work - so generally putting the instantiaton of the shared memory outside the kernel definition seem to be problematic. Hence I do not completely follow what is your suggestion, may you elaborate what would you modify in example above?

shmem_a=CuDynamicSharedArray(Float32, 64)
@cuda threads=length(A) shmem=64*4 mul_kernel(A,shmem_a )

wsmoses · March 2, 2024, 6:12pm

Not quite I’m thinking something like (on mobile untested)

@cuda threads=length(A) shmem=64*4 gradkernel(A, dA, CuDynamicSharedArray(Float32, 64), CuDynamicSharedArray(Float32, 64))

Jakub_Mitura · March 2, 2024, 6:29pm

I modified code in the first post - here at the top to include suggestion still the same problem exist even on the level of mul_kernel - simple kernel execution before enzyme yet.

maxfreu · June 11, 2024, 1:42pm

Have you tried creating the shared memory inside the kernel as it is done here?

Jakub_Mitura · June 11, 2024, 3:13pm

thanks @maxfreu I tried

using CUDA, Enzyme, Test

function mul_kernel(A)
    shared=CuDynamicSharedArray(Float32, length(A))
    i = threadIdx().x
    if i <= length(A)
        shared[i] = A[i] * A[i]
        A[i] = shared[i]
    end
    return nothing
end

function grad_mul_kernel(A, dA)
    Enzyme.autodiff_deferred(mul_kernel, Const, Duplicated(A, dA))
    return nothing
end

A = CUDA.ones(64,)
@cuda threads=length(A) shmem=64*4 mul_kernel(A )
A = CUDA.ones(64,)
dA = similar(A)
dA .= 1
@cuda threads=length(A) shmem=64*4 grad_mul_kernel(A, dA)
@test all(dA .== 2)

and got

ERROR: GPUCompiler.InvalidIRError(GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}(MethodInstance for grad_mul_kernel(::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}), GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}(GPUCompiler.PTXCompilerTarget(v"8.6.0", v"7.5.0", true, nothing, nothing, nothing, nothing, false, nothing, nothing), CUDA.CUDACompilerParams(v"8.6.0", v"8.2.0"), true, nothing, :specfunc, false, 2), 0x0000000000007b37), Tuple{String, Vector{Base.StackTraces.StackFrame}, Any}[("dynamic function invocation", [grad_mul_kernel at get_lin_synth_dat.jl:143], EnzymeCore.autodiff_deferred)])
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/validation.jl:147
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:460 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/Lw5SP/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:459 [inlined]
  [5] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/utils.jl:103
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/utils.jl:97 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:136
  [8] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:111
  [9] compile
    @ ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:103 [inlined]
 [10] #1145
    @ ~/.julia/packages/CUDA/75aiI/src/compiler/compilation.jl:254 [inlined]
 [11] JuliaContext(f::CUDA.var"#1145#1148"{GPUCompiler.CompilerJob{…}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:52
 [12] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/driver.jl:42
 [13] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/75aiI/src/compiler/compilation.jl:253
 [14] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:128
 [15] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:103
 [16] macro expansion
    @ ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:369 [inlined]
 [17] macro expansion
    @ ./lock.jl:267 [inlined]
 [18] cufunction(f::typeof(grad_mul_kernel), tt::Type{Tuple{CuDeviceVector{…}, CuDeviceVector{…}}}; kwargs::@Kwargs{})
    @ CUDA ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:364
 [19] cufunction(f::typeof(grad_mul_kernel), tt::Type{Tuple{CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}}})
    @ CUDA ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:361
 [20] top-level scope
    @ ~/.julia/packages/CUDA/75aiI/src/compiler/execution.jl:112
Some type information was truncated. Use `show(err)` to see complete types.

maxfreu · June 12, 2024, 10:57am

So now it works with the mul_kernel, but not for the grad, right? Unfortunately, I don’t know any further than this. Looks like a bug in enzyme to me :-/

wsmoses · June 12, 2024, 6:19pm

Not sure why CUDA.jl doesn’t give nicer error messages here, but the Enzyme API needs the mode (e.g. Reverse) as the first argument

Jakub_Mitura · June 13, 2024, 6:37am

Yes it was an issue, However when I added Enzyme.Reverse, now compilation hangs for couple hours. Still the issue is low priority for me at the moment, and I do not want to take you from other tasks to look into this.

Enzyme.autodiff_deferred(Enzyme.Reverse,mul_kernel, Const, Duplicated(A, dA))

wsmoses · June 13, 2024, 1:39pm

@vchuravy could this be related to the deadlock issue you were lookint into earlier for michel?

vchuravy · June 14, 2024, 3:46pm

Not that only occurred with execution on the CPU and this seems to be GPU only?

Jakub_Mitura · June 17, 2024, 3:56pm

yes gpu only

Topic		Replies	Views
@cuDynamicSharedMem : allocating beforehand? GPU	2	1361	January 2, 2018
CuDynamicSharedArray error GPU gpu	2	537	November 25, 2021
Initializing @cuStaticSharedMem array? GPU	3	1369	May 12, 2018
CUDAnative dynamic allocation GPU question , cudanative	5	1845	March 4, 2020
@cuStaticSharedMem multidimensional indexing seems not to work GPU question	2	476	October 30, 2020

Enzyme Cuda dynamic memory

Related topics