CUDA.jl - A Clear Example of Dynamic Parallelism

Hello all,

I am not so experienced with CUDA.jl and I cannot find examples on concepts of GPU programming. Dynamic parallelism is a wonderful feature but I could not find an example to learn and apply to my program.

Let’s say we have a kernel. Every thread works independently. However the threads contain for loops which can be parallelized also. As a user, I want to parallelize my parent kernel as well as internal for loops. This is a quite generic situation. How to create a program to perform this?

  1. How do I define a kernel to be initiated inside a kernel?
  2. How do I initiate kernel inside a kernel?
  3. How do I define the children threads to do work?
  4. How do I index children threads?
  5. In general, how can I utilize dynamic parallelism in CUDA.jl?

Can someone provide me an example? I think this would be a reference for future users also. Thank you.

1 Like

Have a look at the test suite: https://github.com/JuliaGPU/CUDA.jl/blob/46084844e30f58141c6fa60512810ab5c8a412e3/test/execution.jl#L824-L840=

Thank you for examples. I understood how they work. I leave here an example so people can take a look. Please correct me if something is wrong.

clearconsole()
# Import related libraries
using CUDA

#this is an example parent kernel, every thread calls child kernel
function example_parent()
    tidx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if tidx <= 5
        @cuda threads = (32, 1, 1) dynamic = true example_child(tidx)
    end
    return nothing
end

#this is an example child kernel, every thread calls a print function
function example_child(tidx)
    tidxx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if tidxx <= 3
        @cuprintln("the message from: parent thread $tidx and child thread $tidxx")
    end
    return nothing
end

@cuda threads = (32, 1, 1) example_parent()

2 Likes

Hi there!

This entry has been extremely helpful to me, so thank you first for that. I have managed to do a couple of simple tests with dynamic parallelism and I am amazed by the results. However, now I wonder if there is a way of calculating the optimal launch configuration for the child kernels. Generally, I launch GPU kernels with this function, which relies on CUDA.launch_configuration to retrieve the optimal number of threads and blocks in order to maximize GPU occupancy.

function runkernel_optimal!(kernel_fn::Function, B::Int, args...)
    kernel = @cuda launch=false kernel_fn(args...)
    config = CUDA.launch_configuration(kernel.fun)
    threads = min(B, config.threads)
    blocks = cld(B, threads)
    kernel(args...; threads, blocks)
end

Is there some way I could also optimize occupancy of child kernels? I’m specially worried about launching more threads than available at the moment.

Thanks in advance. Regards.
David

cudadevrt (the device counterpart of the CUDA library) does seem to provide cudaOccupancyMaxActiveBlocksPerMultiprocessor and cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags, so I guess it’s possible to use the occupancy API on-device. We haven’t wrapped those functions though, but see libcudadevrt.jl in CUDA.jl for what’s possible.

1 Like

So I have taken a look at libcudadevrt.jl and also at the CUDA toolkit documentation, which led me to writing this code:

function cudaOccupancyMaxPotentialBlockSize(minGridSize, blockSize, func, dynamicSMemSize, blockSizeLimit)
	ccall("extern cudaOccupancyMaxPotentialBlockSize", llvmcall, CUDA.cudaError_t,
	(Ptr{Cint}, Ptr{Cint}, Ptr{Cvoid}, Csize_t, Cint),
	minGridSize, blockSize, func, dynamicSMemSize, blockSizeLimit)
	return
end

function child_kernel(a::CuDeviceVector{Int32})
	a[1] = Int32(0)
	return
end

function parent_kernel(a::CuDeviceVector{Int32})
	kernel = @cuda launch=false dynamic=true test_kernel(a)
	br = Ref{Cint}()
	tr = Ref{Cint}()
	#CUDA.cudaGetDeviceCount(br)
	cudaOccupancyMaxPotentialBlockSize(br, tr, kernel.fun, Int32(0), Int32(0))
	a[1] = br[]
	a[2] = tr[]
	return
end

a = CUDA.zeros(Int32, 2)
@cuda parent_kernel(a)
println(a)

The simpler call to cudaGetDeviceCount does work, but the function cudaOccupancyMaxPotentialBlockSize ends up in a LoadError: Failed to link PTX code (nvlink exited with code 255) nvlink error : Undefined reference to 'cudaOccupancyMaxPotentialBlockSize' in '/tmp/jl_eyJo5Z.cubin'. Does this mean that the functionality is not available in the library or am I just missing something else?

That function isn’t listed in https://github.com/JuliaGPU/CUDA.jl/blob/master/src/device/intrinsics/libcudadevrt.jl; the available ones are cudaOccupancyMaxActiveBlocksPerMultiprocessor and cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags.

1 Like