# Slow speed-up in simple GPU kernel

I am trying my first GPU kernel, which essentially boils down to the simplified functions below. I barely get a x1.7 speed up with respect to the multi-threaded version so I’m wondering whether/how the performance of the GPU version can be improved.

``````using CUDA

N = Int(2^20)
F = fill(false, N)
Q = fill(true, N)
d = rand(N)
Fd, Qd = CuArray(F), CuArray(Q)
dd = CuArray(d)
Δ = rand()

function update!(F, Q, dist, Δ)
@inbounds for i in eachindex(F)
F[i] = false
if (Q[i] == true) && (dist[i] ≤ Δ)
F[i] = true
end
end
end

function update_thread!(F, Q, dist, Δ)
Threads.@threads for i in eachindex(F)
@inbounds F[i] = false
@inbounds if (Q[i] == true) && (dist[i] ≤ Δ)
F[i] = true
end
end
end

function _gpu_update!(F, Q, dist, Δ)
index = threadIdx().x
stride = blockDim().x
@inbounds for i = index:stride:length(F)
F[i] = false
if (Q[i] == true) && (dist[i] ≤ Δ)
F[i] = true
end
end
return nothing
end

function gpu_update!(F::CuArray{Bool}, Q::CuArray{Bool}, dist::CuArray{T}, Δ) where T
CUDA.@sync begin
@cuda threads = 1024 _gpu_update!(F, Q, dist, Δ)
end
end

@btime update!(\$F, \$Q, \$d, \$Δ); # 4.062 ms (0 allocations: 0 bytes)
@btime update_thread!(\$F, \$Q, \$d, \$Δ); # 1.005 ms (20 allocations: 1.94 KiB) on 4 threads
@btime gpu_update!(\$Fd, \$Qd, \$dd, \$Δ); # 579.145 μs (179 allocations: 5.77 KiB)
``````

You could try

``````F[i] = Q[i] & (dist[i] < Δ)
``````

It is free of branches. Not sure if the compiler can eliminate the branches in your version.

The problem is that you’re only launching a single block. You need to launch multiple blocks, otherwise you’re essentially only using a single SM of your GPU.

You are right, now it works properly with a ~x66 speed up with respect to the single threaded function. Thanks

``````function _gpu_update!(F, Q, dist, Δ)
index = blockIdx().x * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
@inbounds for i = index:stride:length(F)
F[i] = false
if (Q[i] == true) && (dist[i] ≤ Δ)
F[i] = true
end
end
return
end

function gpu_update!(F::CuArray{Bool}, Q::CuArray{Bool}, dist::CuArray{T}, Δ) where T
nt = 256
numblocks = ceil(Int, (length(F) + nt -1)/nt)
CUDA.@sync begin
@cuda threads = nt blocks = numblocks _gpu_update!(F, Q, dist, Δ)
end
end``````