CUDA.jl write to global memory in PTX

I am developing a transpiler for a uni project, that takes an Expr and converts that into PTX code which can then be called to evaluate that expression (in the context of symbolic regression, meaning a lot different values for variables in the expression).

Now my problem is that I want to store the result of the evaluation in a CuArray passed in as a parameter. This however, results in an error (can be found further down). Note that I am executing this code in a unit test.

This is the most minimalistic version where this error occurs:

ptx = "
	.version 7.1
	.target sm_61
	.address_size 32

	.visible .entry ExpressionProcessing(
	.param .u32 param_1)
	{
		.reg .u32   %parameter<1>;
		.reg .u32   %r<1>;

		ld.param.u32   %r0, [param_1];
		cvta.to.global.u32   %parameter0, %i0;
		st.global.f32  [%parameter0], 10.0;
		ret;
	}"

linker = CuLink()
add_data!(linker, "ExpressionProcessing", ptx)
	
image = complete(linker)
	
mod = CuModule(image)
func = CuFunction(mod, "ExpressionProcessing")

cudaResults = CuArray{Float32}(undef, 1)
cudacall(func, Tuple{CuPtr{Float32}}, cudaResults; threads=1, blocks=1)

As soon as I execute this I get the error:

error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc))
throw_api_error at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/libcuda.jl:30
check at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/libcuda.jl:37 [inlined]
cuStreamDestroy_v2 at /home/daniel/.julia/packages/GPUToolbox/cZlg7/src/ccalls.jl:33 [inlined]
#946 at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/stream.jl:89
#context!#1014 at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/state.jl:168
context! at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/state.jl:163
unsafe_destroy! at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/stream.jl:88
jfptr_unsafe_destroyNOT._13449 at /home/daniel/.julia/compiled/v1.11/CUDA/oWw5k_J6l2h.so (unknown line)
run_finalizer at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gc.c:299
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gc.c:389
run_finalizers at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gc.c:435
ijl_atexit_hook at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/init.c:299
jl_repl_entrypoint at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jlapi.c:1060
main at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
unknown function (ip: 0x7f145c25d249)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)

As far as I know, this indicates that I am accessing memory I am not allowed to, which I think should not be the case?

If I try to read the result like so after the cudacall call:

println(Array(cudaResults))

The error message explodes to this:

WARNING: Error while freeing DeviceMemory(4 bytes at 0x0000000b02000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc))

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/RQqFT/lib/cudadrv/libcuda.jl:30
  [2] check
    @ ~/.julia/packages/CUDA/RQqFT/lib/cudadrv/libcuda.jl:37 [inlined]
  [3] cuMemFreeAsync
    @ ~/.julia/packages/GPUToolbox/cZlg7/src/ccalls.jl:33 [inlined]
  [4] free(mem::CUDA.DeviceMemory; stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/RQqFT/lib/cudadrv/memory.jl:87
  [5] free
    @ ~/.julia/packages/CUDA/RQqFT/lib/cudadrv/memory.jl:82 [inlined]
  [6] #1124
    @ ~/.julia/packages/CUDA/RQqFT/src/memory.jl:710 [inlined]
  [7] #context!#1014
    @ ~/.julia/packages/CUDA/RQqFT/lib/cudadrv/state.jl:168 [inlined]
  [8] context!
    @ ~/.julia/packages/CUDA/RQqFT/lib/cudadrv/state.jl:163 [inlined]
  [9] _pool_free
    @ ~/.julia/packages/CUDA/RQqFT/src/memory.jl:709 [inlined]
 [10] macro expansion
    @ ./timing.jl:421 [inlined]
 [11] pool_free(managed::CUDA.Managed{CUDA.DeviceMemory})
    @ CUDA ~/.julia/packages/CUDA/RQqFT/src/memory.jl:691
 [12] release(::GPUArrays.RefCounted{CUDA.Managed{CUDA.DeviceMemory}})
    @ GPUArrays ~/.julia/packages/GPUArrays/uiVyU/src/host/abstractarray.jl:42
 [13] unsafe_free!
    @ ~/.julia/packages/GPUArrays/uiVyU/src/host/abstractarray.jl:100 [inlined]
 [14] unsafe_free!(x::CuArray{Float32, 1, CUDA.DeviceMemory})
    @ GPUArrays ~/.julia/packages/GPUArrays/uiVyU/src/host/abstractarray.jl:115
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc))
throw_api_error at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/libcuda.jl:30
check at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/libcuda.jl:37 [inlined]
cuModuleUnload at /home/daniel/.julia/packages/GPUToolbox/cZlg7/src/ccalls.jl:33 [inlined]
#978 at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/module.jl:92 [inlined]
#context!#1014 at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/state.jl:168 [inlined]
context! at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/state.jl:163 [inlined]
unsafe_unload! at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/module.jl:91
jfptr_unsafe_unloadNOT._13775 at /home/daniel/.julia/compiled/v1.11/CUDA/oWw5k_J6l2h.so (unknown line)
run_finalizer at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gc.c:299
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gc.c:389
run_finalizers at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gc.c:435
ijl_atexit_hook at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/init.c:299
jl_repl_entrypoint at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jlapi.c:1060
main at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
unknown function (ip: 0x7ff463b01249)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc))
throw_api_error at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/libcuda.jl:30
check at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/libcuda.jl:37 [inlined]
cuStreamDestroy_v2 at /home/daniel/.julia/packages/GPUToolbox/cZlg7/src/ccalls.jl:33 [inlined]
#946 at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/stream.jl:89
#context!#1014 at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/state.jl:168
context! at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/state.jl:163
unsafe_destroy! at /home/daniel/.julia/packages/CUDA/RQqFT/lib/cudadrv/stream.jl:88
jfptr_unsafe_destroyNOT._13449 at /home/daniel/.julia/compiled/v1.11/CUDA/oWw5k_J6l2h.so (unknown line)
run_finalizer at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gc.c:299
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gc.c:389
run_finalizers at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gc.c:435
ijl_atexit_hook at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/init.c:299
jl_repl_entrypoint at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jlapi.c:1060
main at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
unknown function (ip: 0x7ff463b01249)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)

And this would indicate that CUDA.jl / Julia wants to free memory it can’t and I don’t know why.

I am pretty new to Julia and to CUDA as well so maybe I am missing something obvious. But I have tried fixing this error for a couple of hours now and I think I am getting blind in terms of finding a solution, so I have come to this place.

I have updated Julia and CUDA.jl, I have tried many different ways of allocating memory on the GPU like these:

cudaResults = CUDA.alloc(CUDA.DeviceMemory, 1 * sizeof(Float32))
cudaResults = CUDA.fill(0f0, 1)
cudaResults = cu(zeros(Float32, 1))

I have also tried launching the kernel with

launch(func, cudaResults; threads=threads, blocks=blocks)

And I have even asked AI but it didn’t help either.

I think I just can’t see the problem so I really hope some of you can help me find the problem or point me to resources. If there is any information missing, let me know and I will provide it.

CUDA errors are sticky, so once a kernel invalidly touches memory, subsequent API calls (like the free here) will error similarly.

I can’t look into this right now, but you could try comparing the PTX code with the equivalent CUDA.jl kernel (be sure to use @inbounds to cut down on the generated code).

Hi, thanks for the fast answer!

Could you guide me on how to inspect the generated PTX-Code for an equivalent kernel?

I tried the following:

function test_kernel(results)
	@inbounds results[1] = 10f0

	return nothing
end

results = CuArray{Float32}(undef, 1)
@device_code_ptx test_kernel(results)

But I just get the error

calar indexing is disallowed.
  Invocation of setindex! resulted in scalar indexing of a GPU array.
  This is typically caused by calling an iterating implementation of a method.
  Such implementations *do not* execute on the GPU, but very slowly on the CPU,
  and therefore should be avoided.

The CUDA.jl documentation said for more information I should look at the GPUCompiler.jl documentation, but it seems that there is none so I am a bit lost. Maybe I am already to tired, either way, I would appreciate help in getting to see the generated PTX code

You just need to add @cuda:

julia> @device_code_ptx @cuda test_kernel(results)
// PTX CompilerJob of MethodInstance for test_kernel(::CuDeviceVector{Float32, 1}) for sm_86

//
// Generated by LLVM NVPTX Back-End
//

.version 8.5
.target sm_86
.address_size 64

        // .globl       _Z11test_kernel13CuDeviceArrayI7Float32Lx1ELx1EE // -- Begin function _Z11test_kernel13CuDeviceArrayI7Float32Lx1ELx1EE
                                        // @_Z11test_kernel13CuDeviceArrayI7Float32Lx1ELx1EE
.visible .entry _Z11test_kernel13CuDeviceArrayI7Float32Lx1ELx1EE(
        .param .align 8 .b8 _Z11test_kernel13CuDeviceArrayI7Float32Lx1ELx1EE_param_0[16],
        .param .align 8 .b8 _Z11test_kernel13CuDeviceArrayI7Float32Lx1ELx1EE_param_1[32]
)
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<2>;

// %bb.0:                               // %conversion
        ld.param.u64    %rd1, [_Z11test_kernel13CuDeviceArrayI7Float32Lx1ELx1EE_param_1];
        mov.u32         %r1, 1092616192;
        st.global.u32   [%rd1], %r1;
        ret;
                                        // -- End function
}

Great thanks, now it worked!

Also other great news: With this I was able to fix my problem.

Here is the updated PTX-Code:

.version 8.5 // changed from 7.1 to 8.5
.target sm_61
.address_size 64

.visible .entry ExpressionProcessing(
.param .u64 param_1) // changed from .u32 to u.64
{
	.reg .b64   %parameter<1>; // changed from .u32 to .b64
	.reg .b64   %i<1>; // changed from .u32 to .b64

	ld.param.u64   %i0, [param_1]; // changed from .u32 to .b64
	cvta.to.global.u64   %parameter0, %i0; // changed from .u32 to .b64

	st.global.f32  [%parameter0], 10.0;
	ret;
}

With these changes I was also able to get my more complicated PTX code running again!

Thank you guys for helping me out!

1 Like