CUDA | nested loops kernel

Audrius-St · May 7, 2025, 5:29pm

Hello,

As part of learning to use CUDA in Julia, I wrote the following MWE
to compare threaded CPU, CUDA array programming, and CUDA kernel programming performance for a simple element by element array multiplication.

As expected, the CUDA array programming code is significantly faster than the nested 3 loops threaded CPU code.

However, the CUDA kernel nested 3 loops code is, well, significantly slower than the CPU version.

I did not find any examples for CUDA kernel nested loops searching through the CUDA documentation or the Julia Discourse, so I would like to ask what I’m doing wrong and what is the recommend method.

julia> include("test_CUDA_v1.jl")
run_CPU!
  26.030 ms (62 allocations: 7.84 KiB)

z[1:5, 1, 1] = ComplexF64[-1.0 + 3.0im, -1.0 + 3.0im, -1.0 + 3.0im, -1.0 + 3.0im, -1.0 + 3.0im]

run_GPU!
  1.366 ms (119 allocations: 3.72 KiB)

z[1:5, 1, 1] = ComplexF64[-1.0 + 3.0im, -1.0 + 3.0im, -1.0 + 3.0im, -1.0 + 3.0im, -1.0 + 3.0im]

Run GPU kernel: test_nested_loops_GPU!
  2.198 s (33 allocations: 592 bytes)

z[1:5, 1, 1] = ComplexF64[-1.0 + 3.0im, -1.0 + 3.0im, -1.0 + 3.0im, -1.0 + 3.0im, -1.0 + 3.0im]

# test_CUDA_v1.jl

using BenchmarkTools
using CUDA

function run_CPU!(
            z::Array{Complex{Float64}, 3},
            y::Array{Complex{Float64}, 3},
            x::Array{Complex{Float64}, 3},
            n_rows::Int,
            n_cols::Int,
            n_slcs::Int
        )

    @inbounds Threads.@threads for m in 1:n_slcs 
        for l in 1:n_cols
            for k in 1:n_rows
                z[k, l, m] = y[k, l, m] * x[k, l, m]
            end
        end
    end

    return nothing
end

function run_GPU!(            
            z_cu::CuArray{Complex{Float32}, 3},
            y_cu::CuArray{Complex{Float32}, 3},
            x_cu::CuArray{Complex{Float32}, 3}
        )

    CUDA.@sync @inbounds z_cu .= y_cu .* x_cu

    return nothing
end

# GPU kernal
function nested_loops_GPU!(
            z_cu,
            y_cu,
            x_cu
        )
    # X axis
    index_x = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    stride_x = gridDim().x * blockDim().x

    # Y axis
    index_y = (blockIdx().y - Int32(1)) * blockDim().y + threadIdx().y
    stride_y = gridDim().y * blockDim().y

    # Z axis
    index_z = (blockIdx().z - Int32(1)) * blockDim().z + threadIdx().z
    stride_z = gridDim().z * blockDim().z

    k = index_z
    while k <= size(z_cu, 3)

        j = index_y
        while j <= size(z_cu, 2)

            i = index_x
            while i <= size(z_cu, 1)
                
                @inbounds z_cu[i, j, k] = y_cu[i, j, k] * x_cu[i, j, k]
                i += stride_x

            end
            j += stride_y

        end
        k += stride_z

    end

    return nothing
end

function test_nested_loops_GPU!(
            z_cu::CuArray{Complex{Float32}, 3},
            y_cu::CuArray{Complex{Float32}, 3},
            x_cu::CuArray{Complex{Float32}, 3}
        )

    kernel =
        @cuda launch=false nested_loops_GPU!(
                                z_cu,
                                y_cu,
                                x_cu
                            )
    config = launch_configuration(kernel.fun)
    threads = min(length(z_cu), config.threads)
    blocks = cld(length(z_cu), threads)

    # println("threads: ", threads)
    # println("blocks: ", blocks)

    CUDA.@sync begin
        kernel(
            z_cu,
            y_cu,
            x_cu;
            threads,
            blocks
        )
    end
end

function main()
    n_rows = 256
    n_cols = 256
    n_slcs = 192

    x = fill(1.0 + 1.0im, n_rows, n_cols, n_slcs)
    y = fill(1.0 + 2.0im, n_rows, n_cols, n_slcs)
    z = zeros(Complex{Float64}, n_rows, n_cols, n_slcs)

    println("run_CPU!")
    @btime run_CPU!(
        $z,
        $y,
        $x,
        $n_rows,
        $n_cols,
        $n_slcs
    )
    println("")

    @show z[1:5, 1, 1]
    println("")

    fill!(z, zero(eltype(z)))

    # CuArrays
    x_cu = CuArray{Complex{Float32}, 3}(x)
    y_cu = CuArray{Complex{Float32}, 3}(y)
    z_cu = CuArray{Complex{Float32}, 3}(z)

    println("run_GPU!")
    @btime run_GPU!(
        $z_cu,
        $y_cu,
        $x_cu
    )
    println("")

    z = Array{Complex{Float64}, 3}(z_cu)

    @show z[1:5, 1, 1]
    println("")

    fill!(z, zero(eltype(z)))

    println("Run GPU kernel: test_nested_loops_GPU!")
    @btime test_nested_loops_GPU!(        
        $z_cu,
        $y_cu,
        $x_cu
    )
    println("")

    z = Array{Complex{Float64}, 3}(z_cu)

    @show z[1:5, 1, 1]
    println("")

end

begin
    main()
end

maleadt · May 8, 2025, 6:56am

You’re launching a 1D grid (scalar threads and blocks), so the y and z indices and dimensions will always be 1. Doesn’t that imply that each thread will fully iterate the entire range of k and j values, massively overlapping the work?

I’d recommend using linear indices (both for the launch and the index/stride calculation in the kernel). If you really want a ND index, you can convert the linear index to a Cartesian one within the kernel (using CartesianIndices(ndarray)[linear_index]).

Audrius-St · May 10, 2025, 3:42am

I see.

Doesn’t that imply that each thread will fully iterate the entire range of k and j values, massively overlapping the work?

After thinking about it, yes.

I’d recommend using linear indices (both for the launch and the index/stride calculation in the kernel). If you really want a ND index, you can convert the linear index to a Cartesian one within the kernel (using CartesianIndices(ndarray)[linear_index]).

I think I’ve implemented linear indices in test_reshaped_1D_GPU! and Cartesian indices in Run GPU kernel: test_CartesianIndices_3D_GPU! in the MWE code below and both execute with times comparable to array programming.

julia> include("test_CUDA_v1.jl")
run_CPU!
  25.762 ms (62 allocations: 7.47 KiB)

run_array_GPU!
  1.352 ms (119 allocations: 3.72 KiB)

Run GPU kernel: test_reshaped_1D_GPU!
  1.198 ms (33 allocations: 592 bytes)

Run GPU kernel: test_CartesianIndices_3D_GPU!
  1.210 ms (33 allocations: 592 bytes)

I’ll go with the Cartesian indices version as I think that it will be more useful for more complex kernels.

Suggest that some examples of GPU kernels , in the documentation, for multi-dimensional arrays would be helpful to CUDA absolute beginners such as myself.

Thanks for your reply and comments.

MWE:

# test_CUDA_v1.jl

using BenchmarkTools
using CUDA

function run_CPU!(
            z::Array{Complex{Float64}, 3},
            y::Array{Complex{Float64}, 3},
            x::Array{Complex{Float64}, 3}
        )

    @inbounds Threads.@threads for ktr in eachindex(z)
        z[ktr] = y[ktr] * x[ktr]
    end

    return nothing
end

function run_array_GPU!(            
            z_cu::CuArray{Complex{Float32}, 3},
            y_cu::CuArray{Complex{Float32}, 3},
            x_cu::CuArray{Complex{Float32}, 3}
        )

    CUDA.@sync @inbounds z_cu .= y_cu .* x_cu

    return nothing
end

# GPU kernel: reshaped 1D arrays
function reshaped_1D_GPU!(
            z_1D_cu,
            y_1D_cu,
            x_1D_cu
        )
    # 1D axis
    index_x = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    stride_x = gridDim().x * blockDim().x

    i = index_x
    @inbounds while i <= length(z_1D_cu)
        z_1D_cu[i] = y_1D_cu[i] * x_1D_cu[i]

        i += stride_x
    end

    return nothing
end

function test_reshaped_1D_GPU!(
            z_1D_cu::CuArray{Complex{Float32}, 1},
            y_1D_cu::CuArray{Complex{Float32}, 1},
            x_1D_cu::CuArray{Complex{Float32}, 1}
        )

    kernel =
        @cuda launch=false reshaped_1D_GPU!(
                                z_1D_cu,
                                y_1D_cu,
                                x_1D_cu
                            )
    config = launch_configuration(kernel.fun)
    threads = min(length(z_1D_cu), config.threads)
    blocks = cld(length(z_1D_cu), threads)

    # println("threads: ", threads)
    # println("blocks: ", blocks)

    CUDA.@sync begin
        kernel(
            z_1D_cu,
            y_1D_cu,
            x_1D_cu;
            threads,
            blocks
        )
    end
end

# GPU kernal: Cartesian indices 3D
function CartesianIndices_3D_GPU!(
                z_cu,
                y_cu,
                x_cu,
                j_Crtsn
            )
    # X axis
    index_x = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    stride_x = gridDim().x * blockDim().x

    i = index_x
    @inbounds while i <= length(j_Crtsn)
        j = j_Crtsn[i]
        z_cu[j] = y_cu[j] * x_cu[j]

        i += stride_x
    end

    return nothing
end

function test_CartesianIndices_3D_GPU!(
                z_cu::CuArray{Complex{Float32}, 3},
                y_cu::CuArray{Complex{Float32}, 3},
                x_cu::CuArray{Complex{Float32}, 3},
                j_Crtsn::CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}
            )

    kernel =
        @cuda launch=false CartesianIndices_3D_GPU!(
                                z_cu,
                                y_cu,
                                x_cu,
                                j_Crtsn
                            )
    config = launch_configuration(kernel.fun)
    threads = min(length(z_cu), config.threads)
    blocks = cld(length(z_cu), threads)

    # println("threads: ", threads)
    # println("blocks: ", blocks)

    CUDA.@sync begin
        kernel(
            z_cu,
            y_cu,
            x_cu,
            j_Crtsn;
            threads,
            blocks
        )
    end
end

function main()
    n_rows = 256
    n_cols = 256
    n_slcs = 192

    x = fill(1.0 + 1.0im, n_rows, n_cols, n_slcs)
    y = fill(1.0 + 2.0im, n_rows, n_cols, n_slcs)
    z = zeros(Complex{Float64}, n_rows, n_cols, n_slcs)

    #
    # CPU: threaded Cartesian indices
    #

    println("run_CPU!")
    @btime run_CPU!(
        $z,
        $y,
        $x
    )
    println("")

    @show z[1:5, 1, 1]
    println("")

    #
    # GPU: array programming
    #

    fill!(z, zero(eltype(z)))

    # CuArrays
    x_cu = CuArray{Complex{Float32}, 3}(x)
    y_cu = CuArray{Complex{Float32}, 3}(y)
    z_cu = CuArray{Complex{Float32}, 3}(z)

    println("run_array_GPU!")
    @btime run_array_GPU!(
        $z_cu,
        $y_cu,
        $x_cu
    )
    println("")

    z = Array{Complex{Float64}, 3}(z_cu)

    @show z[1:5, 1, 1]
    println("")

    #
    # GPU: reshaped 1D arrays kernel
    #

    # Zero the z array
    fill!(z, zero(eltype(z)))

    # Reshaped 1D arrays
    x_1D = reshape(x, (n_rows * n_cols * n_slcs, ))
    y_1D = reshape(y, (n_rows * n_cols * n_slcs, ))
    z_1D = reshape(z, (n_rows * n_cols * n_slcs, ))

    # CuArrays: 1D
    x_1D_cu = CuArray{Complex{Float32}, 1}(x_1D)
    y_1D_cu = CuArray{Complex{Float32}, 1}(y_1D)
    z_1D_cu = CuArray{Complex{Float32}, 1}(z_1D)

    println("Run GPU kernel: test_reshaped_1D_GPU!")
    @btime test_reshaped_1D_GPU!(        
        $z_1D_cu,
        $y_1D_cu,
        $x_1D_cu
    )
    println("")

    z_1D = Array{Complex{Float64}, 1}(z_1D_cu)

    z = reshape(z_1D, (n_rows, n_cols, n_slcs))

    @show z[1:5, 1, 1]
    @show z[n_rows - 5:end, 1, 1]
    @show z[1, n_cols - 5:end, 1]
    @show z[1, 1, n_slcs - 5:end]
    println("")

    #
    # GPU: Cartesian indices 3D arrays kernel
    #

    # Zero the z array
    fill!(z, zero(eltype(z)))

    j_Crtsn = CartesianIndices(z)

    # CuArrays
    x_cu = CuArray{Complex{Float32}, 3}(x)
    y_cu = CuArray{Complex{Float32}, 3}(y)
    z_cu = CuArray{Complex{Float32}, 3}(z)

    println("Run GPU kernel: test_CartesianIndices_3D_GPU!")
    @btime test_CartesianIndices_3D_GPU!(        
        $z_cu,
        $y_cu,
        $x_cu,
        $j_Crtsn
    )
    println("")

    z = Array{Complex{Float64}, 3}(z_cu)

    @show z[1:5, 1, 1]
    @show z[n_rows - 5:end, 1, 1]
    @show z[1, n_cols - 5:end, 1]
    @show z[1, 1, n_slcs - 5:end]
    println("")
end

begin
    main()
end

eldee · May 10, 2025, 3:09pm

Just to clarify: you could launch a 3D grid, simply by replacing

by something like

threads = (8, 8, 8)  # probably not optimal
blocks = cld.(size(z_cu), threads)

run_CPU!
  25.851 ms (42 allocations: 5.00 KiB)

run_array_GPU!
  787.000 μs (119 allocations: 3.72 KiB)

Run GPU kernel: test_reshaped_1D_GPU!
  776.200 μs (32 allocations: 576 bytes)

Run GPU kernel: test_CartesianIndices_3D_GPU!
  777.400 μs (33 allocations: 592 bytes)

Run GPU kernel: test_nested_loops_GPU!
  774.100 μs (32 allocations: 576 bytes)

There’s just not really any benefit to it over linear indexing, since there’s no 3D structure to exploit. If you would implement, say, a 2D image convolution, then using a 2D grid would actually make sense (though a 1D grid in combination with CartesianIndices would also work of course).

Audrius-St · May 10, 2025, 8:53pm

Thanks for your explanation. You anticipated my next question.

The configuration

threads = (32, 8, 4)

blocks = cld.(size(z_cu), threads)

yields

Run GPU kernel: test_CartesianIndices_3D_GPU!
  1.196 ms (33 allocations: 592 bytes)


Run GPU kernel: test_nested_loops_3D_GPU!
  1.202 ms (33 allocations: 592 bytes)

One additional question: what is the best method to incorporate the 3D nested loop version into the occupancy API?

Experimented with

config = launch_configuration(kernel.fun)
    # 1024 threads per block available
    # threads = (32, 8, 4)  

    threads = 
        (
          config.threads ÷ 32,
          config.threads ÷ 128,
          config.threads ÷ 256
        )
    
    blocks = cld.(size(z_cu), threads)

which yields the same execution time:

Run GPU kernel: test_nested_loops_3D_GPU!
  1.201 ms (33 allocations: 592 bytes)

eldee · May 12, 2025, 4:25pm

You probably want a cubic root in there, so that always prod(threads) == config.threads. I.e. if config.threads == 512 instead of 1024, currently you would use only 128 threads. Some min(1, config.threads ÷ ...) might also be prudent. Maybe you also want to explicitly force threads[1] to be a multiple of 32.

The optimal grid size will presumably also depend on the specifics of a kernel and the size of the inputs. So basically, I don’t think there’s a general best method

Topic		Replies	Views
Why CUDA is so slow on y = x*w + w0? GPU	1	234	March 28, 2024
Slow speed-up in simple GPU kernel New to Julia gpu	3	647	October 25, 2021
Optimizing the use of Blocks, Threads vs. Array Indexing GPU	15	3217	September 21, 2018
Thousands of matrix multiplications using CuArray GPU	5	1218	July 11, 2019
Parallelizaton on GPU slower than on CPU...? Performance gpu	10	2321	January 21, 2020

CUDA | nested loops kernel

Related topics