# How to improve the calculation speed of this kernel on GPU?

Hello everyone!

I have worked quite a bit on setting up a robust foundation for structuring my code and computation loop, with quite a few good pointers from you guys on the forum, so I hope to be able to learn more from you I have a self-contained MWE with automatic benchmarking produced here:

Code Block (MWE)
``````# Packages
using CUDA
using Parameters
using StaticArrays
using LinearAlgebra
using BenchmarkTools

# Constants
const T     = Float32
const ST    = 3
const H     = T(0.5)
const αD    = T((7/(4*π*H^2)))
const ϵ     = T(1e-6)
const NP    = 10^4
const NL    = 10^7

# Main Struct for Gradient Values
@with_kw struct ∇ᵢWᵢⱼStruct{T,ST}
NL::Base.RefValue{Int}
MaxValidIndex::Base.RefValue{Int} = deepcopy(NL)

# Input for calculation
xᵢⱼ     ::AbstractVector{SVector{ST,T}} = zeros(SVector{ST,T},NL[])
xᵢⱼ²    ::AbstractVector{T}             = similar(xᵢⱼ,T,NL[])
dᵢⱼ     ::AbstractVector{T}             = similar(xᵢⱼ,T,NL[])
qᵢⱼ     ::AbstractVector{T}             = similar(xᵢⱼ,T,NL[])
∇ᵢWᵢⱼ   ::AbstractVector{SVector{ST,T}} = similar(xᵢⱼ,NL[])
end

# Helper struct to convert onto GPU
struct ∇ᵢWᵢⱼDeviceStruct{ST,T}
xᵢⱼ     ::CuDeviceVector{SVector{ST,T}, 1}
xᵢⱼ²    ::CuDeviceVector{T, 1}
dᵢⱼ     ::CuDeviceVector{T, 1}
qᵢⱼ     ::CuDeviceVector{T, 1}
∇ᵢWᵢⱼ   ::CuDeviceVector{SVector{ST,T}, 1}
end

# Helper function to resize the specific structs and
# its arrays as how I want it to happen
function Base.resize!(object::∇ᵢWᵢⱼStruct, N::Integer)
# If the current N is larger than the allocated size
# then reallocate a new array with size of N
if N > object.NL[]
object.NL[] = N
for P in propertynames(object)
arr = getfield(object, P)
if isa(arr,AbstractVector)
resize!(arr, N)
fill!(arr,zero(eltype(arr)))
end
end
else
# If the current N is <= to the allocated size
# keep the current array, update MaxValidIndex
# and reset values up to that index instead
object.MaxValidIndex[] = N
for P in propertynames(object)
arr = getfield(object, P)
if P !== :xᵢⱼ
if isa(arr,AbstractVector)
fill!(@view(arr[1:N]),zero(eltype(arr)))
end
end
end

end

return nothing
end

# Adapt Structure for Conversion to GPU
)

NV              = V.MaxValidIndex[]
V.xᵢⱼ²[1:NV]   .= dot.(@view(V.xᵢⱼ[1:NV]),@view(V.xᵢⱼ[1:NV]))
V.dᵢⱼ[1:NV]    .= sqrt.(@view(V.xᵢⱼ²[1:NV]))
V.qᵢⱼ[1:NV]    .= @view(V.dᵢⱼ[1:NV]) ./ H

@. V.∇ᵢWᵢⱼ[1:NV] = αD *5 * (@view(V.qᵢⱼ[1:NV])- 2)^3 * @view(V.qᵢⱼ[1:NV]) / (8H * (@view(V.qᵢⱼ[1:NV]) * H + ϵ)) * @view(V.xᵢⱼ[1:NV])

return nothing
end

### --------------------------- Writing Our Own Custom Kernel  ---------------------------

# Actual Calculation of Gradient in Kernel
# Per default
# CUDA.registers(kernel)
# 9
# We wish to lower as specified by https://github.com/JuliaComputing/Training/blob/master/AdvancedGPU/2-2-kernel_analysis_optimization.ipynb
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
for i = index:stride:Nmax
xdot                  = dot(V.xᵢⱼ[i],V.xᵢⱼ[i])
@inbounds V.xᵢⱼ²[i]  += xdot
dist                  = sqrt(xdot)
@inbounds V.dᵢⱼ[i]   += dist
q                     = dist/H
@inbounds V.qᵢⱼ[i]   += q
wg                    = αD * 5 * (q - 2)^3 * q / (8H * (q * H + ϵ)) * V.xᵢⱼ[i]
@inbounds V.∇ᵢWᵢⱼ[i] += wg
end
return nothing
end

# Actual Function to Call for Kernel Gradient Calc
Nmax    = V.MaxValidIndex[]
config  = launch_configuration(kernel.fun)

CUDA.@sync begin
end
end

### --------------------------- Benchmarking  ---------------------------
IniX         = rand(SVector{ST,T},NL)
V_CPU        = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=IniX)
V_GPU_Simple = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=CuArray(IniX))
V_GPU_Kernel = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=CuArray(IniX))

if isapprox(V_CPU.∇ᵢWᵢⱼ,Array(V_GPU_Simple.∇ᵢWᵢⱼ), rtol=1e-6) & isapprox(V_CPU.∇ᵢWᵢⱼ,Array(V_GPU_Kernel.∇ᵢWᵢⱼ), rtol=1e-6)
println("All three approaches produce the same result with rtol=1e-6.")
end

println("V_CPU_Bench")
display(V_CPU_Bench)
println("V_GPU_Simple_Bench")
display(V_GPU_Simple_Bench)
println("V_GPU_Kernel_Bench")
display(V_GPU_Kernel_Bench)

``````

It runs through the same calculation on; CPU, GPU using high-level Julia abstractions and a third version using a GPU kernel. The timings of each are shown below:

All three approaches produce the same result with rtol=1e-6.
V_CPU_Bench

``````BenchmarkTools.Trial: 60 samples with 1 evaluation.
Range (min … max):  68.682 ms … 122.078 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     78.529 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   83.543 ms ±  13.425 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

▂ ▂▅▂▅█ ▂      ▅
█▅▅███████▅██▁▅▅███▅▅█▁▁▅▁█▁▁▁▁▅▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▅▁▅▅▁▁▁▁▁▁▅▁▅ ▁
68.7 ms         Histogram: frequency by time          122 ms <

Memory estimate: 3.30 KiB, allocs estimate: 50.
``````

V_GPU_Simple_Bench

``````BenchmarkTools.Trial: 2082 samples with 1 evaluation.
Range (min … max):  1.931 ms … 32.706 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     2.143 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   2.393 ms ±  1.542 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

▆█▆▄▄▂▁
████████▇▇▅▆▄▆▄▄▁▆▄▄▄▁▄▃▄▅▃▃▅▃▁▃▃▃▁▁▅▁▁▃▁▁▁▁▃▁▃▃▁▅▁▁▁▁▁▁▁▃ █
1.93 ms      Histogram: log(frequency) by time     8.67 ms <

Memory estimate: 18.20 KiB, allocs estimate: 246.
``````

V_GPU_Kernel_Bench

``````BenchmarkTools.Trial: 2461 samples with 1 evaluation.
Range (min … max):  1.757 ms …  13.909 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     1.950 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   2.022 ms ± 483.841 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

▁▄▇▆█▂
▃▃▄▆▇██████▆▄▃▃▃▃▃▄▄▃▄▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▂▂▁▁▁▁▂▁▂ ▃
1.76 ms         Histogram: frequency by time        3.06 ms <

Memory estimate: 5.98 KiB, allocs estimate: 93.
``````

It is clear to me that the custom kernel function;

``````function GPU_GRAD(V,Nmax)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
for i = index:stride:Nmax
xdot                  = dot(V.xᵢⱼ[i],V.xᵢⱼ[i])
@inbounds V.xᵢⱼ²[i]  += xdot
dist                  = sqrt(xdot)
@inbounds V.dᵢⱼ[i]   += dist
q                     = dist/H
@inbounds V.qᵢⱼ[i]   += q
wg                    = αD * 5 * (q - 2)^3 * q / (8H * (q * H + ϵ)) * V.xᵢⱼ[i]
@inbounds V.∇ᵢWᵢⱼ[i] += wg
end
return nothing
end
``````

My question now is how, if any way, can I speed up this calculation?

To be a bit frank, I think the code “looks simple” and I’ve seen a lot more complex codes with something about shared memory and other tricks etc. If I wish to save the intermediate results as I do in the code currently, would there be any way to speed this up?

Kind regards

As you’re doing a pure element-wise transformation, shared memory and other tricks are not relevant here. See Training/2-2-kernel_analysis_optimization.ipynb at master · JuliaComputing/Training · GitHub for other optimization advice; you’ll have to use the profiler tools to squeeze the last out of this kernel. You may want to mark your entire `for` look `@inbounds`, and avoid unneeded 64-bit registers to reduce register pressure. Also try playing with different launch configurations, what’s returned by the occupancy API isn’t always optimal for your specific hardware.

2 Likes

Thank you for the hints!

I tested @inbounds in front of the loop and unfortunately no effect (or perhaps nice that the compiler did it it self!)

I tested launch configs and found that I could not find any beating my current launch config

I did not look into profiling on GPU, but all of that is also a bit out of my league now I think. For now I will accept this performance so I can develop the rest of what I want and hopefully gain more skill meanwhile

EDIT: I could not get registers below 29 for GPU_GRAD kernel, only with “maxregs” option set to 8, which turned registers to 16 - but worse performance by 4 times

Kind regards