Converting an image rotation demo from JuliaCon 2021 (AMD -> NVIDIA)

I’ve been trying to convert the rotate_kernel function from the 2021 JuliaCon GPU Workshop to work with CUDA, but I’m having trouble.

I’m new to building kernels on the GPU, but my current implementation of these functions are as follows:

function rotate_kernel(out, inp, angle)
    x_idx = (blockDim().x * (blockIdx().x - 1)) + threadIdx().x
    y_idx = (blockDim().y * (blockIdx().y - 1)) + threadIdx().y
    x_centidx = x_idx - (size(inp,1)÷2)
    y_centidx = y_idx - (size(inp,2)÷2)
    x_outidx = round(Int, (x_centidx*cos(angle)) + (y_centidx*-sin(angle)))
    y_outidx = round(Int, (x_centidx*sin(angle)) + (y_centidx*cos(angle)))
    x_outidx += (size(inp,1)÷2)
    y_outidx += (size(inp,2)÷2)
    if (1 <= x_outidx <= size(out,1)) &&
       (1 <= y_outidx <= size(out,2))
        out[x_outidx, y_outidx] = inp[x_idx, y_idx]
    end
    return
end

and

function exec_gpu(f, sz, args...)
	@cuda f(args...)
end

where my “lilly” array is a 250x250 array of Float32:

lilly = rand(250,250)
lilly_gpu = CuArray(lilly)
lilly_rotated = similar(lilly_gpu)
lilly_rotated .= 0

and my functions are being called as follows:

exec_gpu(rotate_kernel, size(lilly_gpu), lilly_rotated, lilly_gpu, deg2rad(37))

This seems to run fine on my machine, giving me an output of:

CUDA.HostKernel{typeof(rotate_kernel), Tuple{CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}, Float64}}(rotate_kernel, CuFunction(Ptr{Nothing} @0x00000000c2fbef10, CuModule(Ptr{Nothing} @0x00000000c29103b0, CuContext(0x000000008baf59a0, instance 83eb46d269112d80))), CUDA.KernelState(Ptr{Nothing} @0x0000000604000000))

But when I try to image Array(lilly_rotated), only the zeros array is shown, and not the rotated array. What am I doing wrong? Any help is appreciated!

You’re only launching a single thread, so it’s expected that your output will be zero. If you check the @roc invocation from the workshop, it specifies groupsize=(32,32) gridsize=sz. You will similarly need to launch multiple CUDA threads and blocks.

@maleadt thank you for your suggestion! Based on other samples in the CUDA Github page, I was able to construct the following script:

using CUDA
using PyPlot
using CUDA: i32

function rotate_kernel(out, inp, angle)
    x_idx = (blockDim().x * (blockIdx().x - 1)) + threadIdx().x
    y_idx = (blockDim().y * (blockIdx().y - 1)) + threadIdx().y
    x_centidx = x_idx - (size(inp,1)÷2)
    y_centidx = y_idx - (size(inp,2)÷2)
    x_outidx = round(Int, (x_centidx*cos(angle)) + (y_centidx*-sin(angle)))
    y_outidx = round(Int, (x_centidx*sin(angle)) + (y_centidx*cos(angle)))
    x_outidx += (size(inp,1)÷2)
    y_outidx += (size(inp,2)÷2)
    if (1 <= x_outidx <= size(out,1)) &&
       (1 <= y_outidx <= size(out,2))
        out[x_outidx, y_outidx] = inp[x_idx, y_idx]
    end
    return
end

n = 300
dev = CuDevice(0)

dims = (n, n)
a = round.(rand(Float32, dims) * 100)
out = similar(a)
out .= 0

d_a = CuArray(a)
d_out = CuArray(out)
angle = deg2rad(40)
len = prod(dims)

kernel = @cuda launch=false rotate_kernel(d_out, d_a, angle)
config = launch_configuration(kernel.fun)
threads = min(len, config.threads)
blocks = cld(len, threads)

# Run rotation kernel
kernel(d_out, d_a, angle; threads=threads, blocks=blocks)

# Image original/rotated arrays
figure()
subplot(1,2,1)
imshow(Array(d_a)); axis("off")
subplot(1,2,2)
imshow(Array(d_out)); axis("off")
tight_layout()
show(); gcf()

Per your suggestion, the threads and blocks are determined from the size of my input array. The input and output arrays are shown below. I suspect the single-lined output is a result of me asking the CPU to show the output array before the GPU has completed its computations.

Is this accurate? Currently, I’m working on a way to make sure the GPU has completed its processing before the output array is plotted by the CPU – according to the CUDA documentation, CUDA.synchronize seems to fit the bill, but I’ve yet to find a working solution.

I would really appreciate any hints or suggestions you may have, and thank you so much for all your help and insight!

This operation is implicitly synchronizing (as long as you execute it on the same task, which you seem to be doing here) so you don’t need an explicit call to synchronize(). But it doesn’t hurt, of course.

Also try using an image from the TestImages package or so, to check the results. Here you can see how it can be used with CUDA.jl, CUDA.jl 1.1 ⋅ JuliaGPU, albeit in the context of textures (which may be relevant for you as well as they make it easier and faster to interpolate between points).