How to run ptx code on CUDA from julia?

Marcell_Havlik · January 26, 2024, 1:31pm

Hi

I am trying to create a NN library that directly output a very optimized version of ptx code, also would be open to output even lower level code if possible. But of course I see I have to go step by step as it is getting exponentially more and more complex to go lower and have to understand times more details.

Can anyone help me on how to run, ptx or lower level of code or even binary on the CUDA gpu?

I saw CUDA.jl has CuDeviceFunction and cufunction that could be used, but couldn’t understand how could I use this to run low level codes.

Thank you for any help!

Pangoraw · January 26, 2024, 1:52pm

You can use the lower-level CuModule to load a cubin binary and CuFunction(mod name) to create a cufunction. ptxas can be used to get cubin from the ptx.


using CUDA

N, D = 1, 1

a = cu(randn(Float32,N,D));
b = cu(randn(Float32,N,D));
out = cu(zeros(Float32,N,N));

ptx_path = "output.ptx"
cubin_path = "output.cubin"

run(`$(CUDA.ptxas()) --gpu-name sm_75 $ptx_path --output-file $cubin_path --verbose`)

mod = CuModule(read(cubin_path))
func = CuFunction(mod, "pairwise_l2_kernel_0d1d2d3c4c5c")

CUDA.@sync CUDA.cudacall(
    func, (CuPtr{Float32}, CuPtr{Float32}, CuPtr{Float32}),
    a, b, out;
    blocks=1, threads=1, shmem=512,
)

Marcell_Havlik · January 26, 2024, 2:14pm

Wow! This is crazy if I can run .cubin files basically!

Marcell_Havlik · January 26, 2024, 2:51pm

Any idea on why I get error at this line:

mod = CuModule(read(cubin_path))

ERROR: CUDA error: device kernel image is invalid (code 300, ERROR_INVALID_SOURCE)

For you this code works perfectly do I assume it correctly? I tried multiple different working ptx files but always the same message, even tried to change the sm_75 to sm_52.

The code module.jl where the error comes. So I guess res is something “INVALID_SOURCE”

maleadt · January 26, 2024, 3:02pm

It’s not strictly required to compile PTX to a CUBIN; you can invoke CuModule with PTX code too, and have the driver JIT-compile it to native code.

maleadt · January 26, 2024, 3:04pm

Apart from that though, cuModuleLoadDataEx should support both CUBIN and PTX input. Maybe verify that the cubin is valid (e.g. pass it to cuobjdump or nvdisasm or so) and check that the buffer is NULL terminated?

Marcell_Havlik · January 26, 2024, 3:07pm

With that way, it worked. No idea what could be the problem with my explicit build. I will look into your ptx build process too.

Also this way I stuck at the next line:

func = CuFunction(mod, "pairwise_l2_kernel_0d1d2d3c4c5c")

Where is this functionname comes from? “pairwise_l2_kernel_0d1d2d3c4c5c”

maleadt · January 26, 2024, 3:08pm

That’s the kernel you define in your PTX input.

Marcell_Havlik · January 26, 2024, 3:11pm

cuobjdump return with nothing.
nvdisasm prints out the asm.
I guess this will be good then.

Marcell_Havlik · January 26, 2024, 3:14pm

I could use the “test_sum” in this case.

Now I crashed my gpu by running the code. So there must be some mistake with the ptrs.
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)

I think I made it based on your recommendations, just this crash has to be solved! Will be back if everything works perfectly.

Marcell_Havlik · January 26, 2024, 3:27pm

I modified this .cu file and regenerated the .ptx file to be sure that it doesn’t access the edge of the arrays just to make sure. No success.

Tried everything I could and now I made it work with running with block=1, threads=1:

CUDA.@sync CUDA.cudacall(
    func, 
    (CuPtr{Float32}, CuPtr{Float32}, CuPtr{Float32}),
    a, b, out;
    blocks=1, threads=1, shmem=0,
)

After this I can change threads to anything.
Really strange behaviour.

And now I tried it multiple times already this way. It always succeeds. What can this problem be.

Marcell_Havlik · January 26, 2024, 8:42pm

I know it is weekend. But want to share the problem was with my .cu kernel. Really. In the most basic kernel.

The problem was that in the cu kernel I used this:
int I = ((blockIdx.x - 1) * blockDim.x) + threadIdx.x+1;
instead of:
int i = blockIdx.x*blockDim.x + threadIdx.x;

Thank you for the help for everyone, this is already extraordinary!