CUDA.jl - A Clear Example of Dynamic Parallelism

moukann · April 1, 2022, 9:08am

Hello all,

I am not so experienced with CUDA.jl and I cannot find examples on concepts of GPU programming. Dynamic parallelism is a wonderful feature but I could not find an example to learn and apply to my program.

Let’s say we have a kernel. Every thread works independently. However the threads contain for loops which can be parallelized also. As a user, I want to parallelize my parent kernel as well as internal for loops. This is a quite generic situation. How to create a program to perform this?

How do I define a kernel to be initiated inside a kernel?
How do I initiate kernel inside a kernel?
How do I define the children threads to do work?
How do I index children threads?
In general, how can I utilize dynamic parallelism in CUDA.jl?

Can someone provide me an example? I think this would be a reference for future users also. Thank you.

maleadt · April 1, 2022, 1:02pm

Have a look at the test suite: https://github.com/JuliaGPU/CUDA.jl/blob/46084844e30f58141c6fa60512810ab5c8a412e3/test/execution.jl#L824-L840=

moukann · April 3, 2022, 6:09pm

Thank you for examples. I understood how they work. I leave here an example so people can take a look. Please correct me if something is wrong.

clearconsole()
# Import related libraries
using CUDA

#this is an example parent kernel, every thread calls child kernel
function example_parent()
    tidx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if tidx <= 5
        @cuda threads = (32, 1, 1) dynamic = true example_child(tidx)
    end
    return nothing
end

#this is an example child kernel, every thread calls a print function
function example_child(tidx)
    tidxx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if tidxx <= 3
        @cuprintln("the message from: parent thread $tidx and child thread $tidxx")
    end
    return nothing
end

@cuda threads = (32, 1, 1) example_parent()

dcasbol · November 16, 2022, 1:32pm

Hi there!

This entry has been extremely helpful to me, so thank you first for that. I have managed to do a couple of simple tests with dynamic parallelism and I am amazed by the results. However, now I wonder if there is a way of calculating the optimal launch configuration for the child kernels. Generally, I launch GPU kernels with this function, which relies on CUDA.launch_configuration to retrieve the optimal number of threads and blocks in order to maximize GPU occupancy.

function runkernel_optimal!(kernel_fn::Function, B::Int, args...)
    kernel = @cuda launch=false kernel_fn(args...)
    config = CUDA.launch_configuration(kernel.fun)
    threads = min(B, config.threads)
    blocks = cld(B, threads)
    kernel(args...; threads, blocks)
end

Is there some way I could also optimize occupancy of child kernels? I’m specially worried about launching more threads than available at the moment.

Thanks in advance. Regards.
David

maleadt · November 16, 2022, 1:46pm

cudadevrt (the device counterpart of the CUDA library) does seem to provide cudaOccupancyMaxActiveBlocksPerMultiprocessor and cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags, so I guess it’s possible to use the occupancy API on-device. We haven’t wrapped those functions though, but see libcudadevrt.jl in CUDA.jl for what’s possible.

dcasbol · November 17, 2022, 8:50am

So I have taken a look at libcudadevrt.jl and also at the CUDA toolkit documentation, which led me to writing this code:

function cudaOccupancyMaxPotentialBlockSize(minGridSize, blockSize, func, dynamicSMemSize, blockSizeLimit)
	ccall("extern cudaOccupancyMaxPotentialBlockSize", llvmcall, CUDA.cudaError_t,
	(Ptr{Cint}, Ptr{Cint}, Ptr{Cvoid}, Csize_t, Cint),
	minGridSize, blockSize, func, dynamicSMemSize, blockSizeLimit)
	return
end

function child_kernel(a::CuDeviceVector{Int32})
	a[1] = Int32(0)
	return
end

function parent_kernel(a::CuDeviceVector{Int32})
	kernel = @cuda launch=false dynamic=true test_kernel(a)
	br = Ref{Cint}()
	tr = Ref{Cint}()
	#CUDA.cudaGetDeviceCount(br)
	cudaOccupancyMaxPotentialBlockSize(br, tr, kernel.fun, Int32(0), Int32(0))
	a[1] = br[]
	a[2] = tr[]
	return
end

a = CUDA.zeros(Int32, 2)
@cuda parent_kernel(a)
println(a)

The simpler call to cudaGetDeviceCount does work, but the function cudaOccupancyMaxPotentialBlockSize ends up in a LoadError: Failed to link PTX code (nvlink exited with code 255) nvlink error : Undefined reference to 'cudaOccupancyMaxPotentialBlockSize' in '/tmp/jl_eyJo5Z.cubin'. Does this mean that the functionality is not available in the library or am I just missing something else?

maleadt · November 18, 2022, 8:24am

That function isn’t listed in https://github.com/JuliaGPU/CUDA.jl/blob/master/src/device/intrinsics/libcudadevrt.jl; the available ones are cudaOccupancyMaxActiveBlocksPerMultiprocessor and cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags.

Topic		Replies	Views
Dynamic parallelism slow in CUDA.jl GPU	1	90	July 25, 2024
Clarifying expected behavior of dynamic CUDA kernels GPU question , parallel , cuda , dynamic-parallelism	4	116	January 12, 2025
How do I make sure that GPU functions use the maximum potential config for performance? GPU	3	318	January 16, 2023
@cuda threads and blocks confusion GPU	9	3677	February 10, 2021
Kernel with dynamic parallelism seems to be calling CPU functions GPU	4	122	July 19, 2025

CUDA.jl - A Clear Example of Dynamic Parallelism

Related topics