I am trying to write a simple CUDA kernel that swaps pairs of columns in a square matrix and puts a sign on one of them:

using CUDA, BenchmarkTools
function swap_and_sign!(A::Matrix)
n = size(A, 1) ÷ 2
for i in 1:2n
for j in 2:2:2n
A[i,j], A[i,j-1] = A[i,j-1], -A[i,j]
end
end
nothing
end
function swap_and_sign!(A::CuArray)
n = size(A, 1) ÷ 2
threads = n
blocks = 2n
@inline function kernel(data)
i, j = blockIdx().x, 2 * threadIdx().x
data[i,j], data[i,j-1] = data[i,j-1], -data[i,j]
nothing
end
@cuda threads=threads blocks=blocks kernel(A)
nothing
end
arr = rand(1000, 1000)
cuarr = arr |> cu
@btime CUDA.@sync swap_and_sign!(cuarr)
# 2.469 ms (19 allocations: 576 bytes)
@btime swap_and_sign!(arr)
# 1.563 ms (0 allocations: 0 bytes)

As you see, even for a 1000×1000 matrix this is slower than running the same on the CPU. Is there something more clever I could do with indexing and number of threads and blocks to speed things up, or is this performance expected?

For reference, the GPU is a GeForce MX250 (so not the beefiest) and the CPUa Intel i7-10850H @ 2.7GHz

If I remember correctly there might be special things to do to properly benchmark GPU cude, besides just CUDA.@sync. But a quick fix you could try is switching your array to Float32, that’s usually the type for which GPUs are optimized.

There’s many optimizations you could apply, from making sure there’s no exceptions, optimizing the iteration order to maximize coalescing, optimizing the launch configuration, etc. See Performance Tips · CUDA.jl for a couple of Julia-specific hints, but ultimately this is a CUDA optimization problem.

That said, your GPU just seems extraordinarily underpowered here. On my workstation with high-end CPU and GPU hardware, I get 500us for the CPU version, and 36us for the GPU version.

What specifically are you thinking about? Just doing CUDA.@sync should be sufficient. There’s also the integrated profiler, of course (CUDA.@profile and CUDA.@bprofile) which show a lot of additional information that could help optimize this.

function swap_and_sign_transpose!(A::Matrix)
n = size(A, 1) ÷ 2
@inbounds for j in 2:2:2n
for i in 1:2n
A[i,j], A[i,j-1] = A[i,j-1], -A[i,j]
end
end
nothing
end

if you want to improve the CPU’s performance. Replace @inbounds with Polyester.@batch if you want multithreading.

Regarding making the type Float32: Sending the array to the GPU via arr |> cu seems to do that automatically. cuarr shows as CuArray{Float32, 2, CUDA.DeviceMemory}.

It turned out it was indeed mostly the GPU being enormously underpowered. Running the same code on a proper GPU gave me similar numbers to yours and further optimising as shown below made it even faster. But since the main culprit was the weak GPU and not actually something enourmously stupid in the code I am gonna accept that answer.

function swap_and_sign!(A::CuArray)
n = size(A, 1) ÷ 2
@inline function kernel_fun!(data)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
i_stride = gridDim().x * blockDim().x
# multiply with 2 here for correct indexing.
j = 2*((blockIdx().y-1) * blockDim().y + threadIdx().y)
j_stride = 2 * gridDim().y * blockDim().y
while i <= 2n
while j <= 2n
data[i,j], data[i,j-1] = data[i,j-1], -data[i,j]
j += j_stride
end
i += i_stride
end
nothing
end
kernel = @cuda launch=false kernel_fun!(A)
config = launch_configuration(kernel.fun)
# this seemed to give the faster iteration order
threads_x = min(2n, config.threads)
threads_y = min(fld(config.threads, threads_x), n) # put all remaining threads in here
blocks_y = cld(n, threads_y)
blocks_x = cld(2n, threads_x)
threads = (threads_x, threads_y)
blocks = (blocks_x, blocks_y)
CUDA.@sync kernel(A; threads, blocks)
nothing
end