CUDA.jl - A Clear Example of Dynamic Parallelism

Hello all,

I am not so experienced with CUDA.jl and I cannot find examples on concepts of GPU programming. Dynamic parallelism is a wonderful feature but I could not find an example to learn and apply to my program.

Let’s say we have a kernel. Every thread works independently. However the threads contain for loops which can be parallelized also. As a user, I want to parallelize my parent kernel as well as internal for loops. This is a quite generic situation. How to create a program to perform this?

  1. How do I define a kernel to be initiated inside a kernel?
  2. How do I initiate kernel inside a kernel?
  3. How do I define the children threads to do work?
  4. How do I index children threads?
  5. In general, how can I utilize dynamic parallelism in CUDA.jl?

Can someone provide me an example? I think this would be a reference for future users also. Thank you.

Have a look at the test suite: CUDA.jl/execution.jl at 46084844e30f58141c6fa60512810ab5c8a412e3 · JuliaGPU/CUDA.jl · GitHub

Thank you for examples. I understood how they work. I leave here an example so people can take a look. Please correct me if something is wrong.

# Import related libraries
using CUDA

#this is an example parent kernel, every thread calls child kernel
function example_parent()
    tidx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if tidx <= 5
        @cuda threads = (32, 1, 1) dynamic = true example_child(tidx)
    return nothing

#this is an example child kernel, every thread calls a print function
function example_child(tidx)
    tidxx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if tidxx <= 3
        @cuprintln("the message from: parent thread $tidx and child thread $tidxx")
    return nothing

@cuda threads = (32, 1, 1) example_parent()

1 Like