How to run ptx code on CUDA from julia?

Hi

I am trying to create a NN library that directly output a very optimized version of ptx code, also would be open to output even lower level code if possible. But of course I see I have to go step by step as it is getting exponentially more and more complex to go lower and have to understand times more details.

Can anyone help me on how to run, ptx or lower level of code or even binary on the CUDA gpu?

I saw CUDA.jl has CuDeviceFunction and cufunction that could be used, but couldn’t understand how could I use this to run low level codes.

Thank you for any help!

You can use the lower-level CuModule to load a cubin binary and CuFunction(mod name) to create a cufunction. ptxas can be used to get cubin from the ptx.


using CUDA

N, D = 1, 1

a = cu(randn(Float32,N,D));
b = cu(randn(Float32,N,D));
out = cu(zeros(Float32,N,N));

ptx_path = "output.ptx"
cubin_path = "output.cubin"

run(`$(CUDA.ptxas()) --gpu-name sm_75 $ptx_path --output-file $cubin_path --verbose`)

mod = CuModule(read(cubin_path))
func = CuFunction(mod, "pairwise_l2_kernel_0d1d2d3c4c5c")

CUDA.@sync CUDA.cudacall(
    func, (CuPtr{Float32}, CuPtr{Float32}, CuPtr{Float32}),
    a, b, out;
    blocks=1, threads=1, shmem=512,
)
2 Likes

Wow! This is crazy if I can run .cubin files basically!

Any idea on why I get error at this line:

mod = CuModule(read(cubin_path))

ERROR: CUDA error: device kernel image is invalid (code 300, ERROR_INVALID_SOURCE)

For you this code works perfectly do I assume it correctly? I tried multiple different working ptx files but always the same message, even tried to change the sm_75 to sm_52.

The code module.jl where the error comes. So I guess res is something “INVALID_SOURCE”

1 Like

It’s not strictly required to compile PTX to a CUBIN; you can invoke CuModule with PTX code too, and have the driver JIT-compile it to native code.

2 Likes

Apart from that though, cuModuleLoadDataEx should support both CUBIN and PTX input. Maybe verify that the cubin is valid (e.g. pass it to cuobjdump or nvdisasm or so) and check that the buffer is NULL terminated?

1 Like

With that way, it worked. No idea what could be the problem with my explicit build. I will look into your ptx build process too.

Also this way I stuck at the next line:

func = CuFunction(mod, "pairwise_l2_kernel_0d1d2d3c4c5c")

Where is this functionname comes from? “pairwise_l2_kernel_0d1d2d3c4c5c”

That’s the kernel you define in your PTX input.

1 Like

cuobjdump return with nothing.
nvdisasm prints out the asm.
I guess this will be good then.

I could use the “test_sum” in this case.

Now I crashed my gpu by running the code. So there must be some mistake with the ptrs. :smiley:
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)

I think I made it based on your recommendations, just this crash has to be solved! Will be back if everything works perfectly.

I modified this .cu file and regenerated the .ptx file to be sure that it doesn’t access the edge of the arrays just to make sure. No success.

Tried everything I could and now I made it work with running with block=1, threads=1:

CUDA.@sync CUDA.cudacall(
    func, 
    (CuPtr{Float32}, CuPtr{Float32}, CuPtr{Float32}),
    a, b, out;
    blocks=1, threads=1, shmem=0,
)

After this I can change threads to anything.
Really strange behaviour.

And now I tried it multiple times already this way. It always succeeds. What can this problem be.

I know it is weekend. But want to share the problem was with my .cu kernel. Really. :smiley: In the most basic kernel.

The problem was that in the cu kernel I used this:
int I = ((blockIdx.x - 1) * blockDim.x) + threadIdx.x+1;
instead of:
int i = blockIdx.x*blockDim.x + threadIdx.x;

Thank you for the help for everyone, this is already extraordinary! :smiling_face_with_three_hearts:

2 Likes