Thank God, I was finally able to do it exactly how I wanted! And this answers my 5th question:
5. Is the way shown here the intended and “state of the art” approach for using structs in GPU programming - if not, what is? I looked at the interpolate example in the docs, and I struggle with grasping it.
Short answer, no. The code below is in my opinion much better:
# Packages
using CUDA
using Parameters
using StaticArrays
using Adapt
using Setfield
using LinearAlgebra
using BenchmarkTools
# Constants
const T = Float32
const ST = 3
const H = T(0.5)
const αD = T((7/(4*π*H^2)))
const ϵ = T(1e-6)
const NP = 10^4
const NL = 10^7
# Main Struct for Gradient Values
@with_kw struct ∇ᵢWᵢⱼStruct{T,ST}
NL::Base.RefValue{Int}
# Input for calculation
xᵢⱼ ::AbstractVector{SVector{ST,T}} = zeros(SVector{ST,T},NL[])
xᵢⱼ² ::AbstractVector{T} = similar(xᵢⱼ,T,NL[])
dᵢⱼ ::AbstractVector{T} = similar(xᵢⱼ,T,NL[])
qᵢⱼ ::AbstractVector{T} = similar(xᵢⱼ,T,NL[])
∇ᵢWᵢⱼ ::AbstractVector{SVector{ST,T}} = similar(xᵢⱼ,NL[])
end
# Helper struct to convert onto GPU
struct ∇ᵢWᵢⱼDeviceStruct{ST,T}
xᵢⱼ ::CuDeviceVector{SVector{ST,T}, 1}
xᵢⱼ² ::CuDeviceVector{T, 1}
dᵢⱼ ::CuDeviceVector{T, 1}
qᵢⱼ ::CuDeviceVector{T, 1}
∇ᵢWᵢⱼ ::CuDeviceVector{SVector{ST,T}, 1}
end
# Adapt Structure for Conversion to GPU
Adapt.adapt_structure(to, s::∇ᵢWᵢⱼStruct) = ∇ᵢWᵢⱼDeviceStruct(
adapt(to, s.xᵢⱼ ),
adapt(to, s.xᵢⱼ² ),
adapt(to, s.dᵢⱼ ),
adapt(to, s.qᵢⱼ ),
adapt(to, s.∇ᵢWᵢⱼ),
)
# Simple Grad Version
function GradVersion0(V::∇ᵢWᵢⱼStruct)
V.xᵢⱼ² .= dot.(V.xᵢⱼ,V.xᵢⱼ)
V.dᵢⱼ .= sqrt.(V.xᵢⱼ²)
V.qᵢⱼ .= V.dᵢⱼ ./ H
@. V.∇ᵢWᵢⱼ = αD *5 * (V.qᵢⱼ- 2)^3 * V.qᵢⱼ / (8H * (V.qᵢⱼ * H + ϵ)) * V.xᵢⱼ
return nothing
end
### --------------------------- Writing Our Own Custom Kernel ---------------------------
# Actual Calculation of Gradient in Kernel
function GPU_GRAD(V)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
for i = index:stride:length(V.xᵢⱼ)
xdot = dot(V.xᵢⱼ[i],V.xᵢⱼ[i])
@inbounds V.xᵢⱼ²[i] += xdot
dist = sqrt(xdot)
@inbounds V.dᵢⱼ[i] += dist
q = dist/H
@inbounds V.qᵢⱼ[i] += q
wg = αD * 5 * (q - 2)^3 * q / (8H * (q * H + ϵ)) * V.xᵢⱼ[i]
@inbounds V.∇ᵢWᵢⱼ[i] += wg
end
return nothing
end
# Actual Function to Call for Kernel Gradient Calc
function GradVersion0_Kernel(V)
kernel = @cuda launch=false GPU_GRAD(V)
config = launch_configuration(kernel.fun)
threads = min(length(V.xᵢⱼ), config.threads)
blocks = cld(length(V.xᵢⱼ), threads)
CUDA.@sync begin
kernel(V; threads, blocks)
end
end
### --------------------------- Benchmarking ---------------------------
IniX = rand(SVector{ST,T},NL)
V_CPU = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=IniX)
V_GPU_Simple = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=CuArray(IniX))
V_GPU_Kernel = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=CuArray(IniX))
@CUDA.time GradVersion0(V_CPU)
@CUDA.time GradVersion0(V_GPU_Simple)
@CUDA.time GradVersion0_Kernel(V_GPU_Kernel)
if isapprox(V_CPU.∇ᵢWᵢⱼ,Array(V_GPU_Simple.∇ᵢWᵢⱼ), rtol=1e-6) & isapprox(V_CPU.∇ᵢWᵢⱼ,Array(V_GPU_Kernel.∇ᵢWᵢⱼ), rtol=1e-6)
println("All three approaches produce the same result with rtol=1e-6.")
end
V_CPU_Bench = @benchmark GradVersion0($V_CPU)
# I could not get sensible timings other than including @CUDA.sync here too, and not only in GradVersion0_Kernel
V_GPU_Simple_Bench = @benchmark @CUDA.sync GradVersion0($V_GPU_Simple)
V_GPU_Kernel_Bench = @benchmark @CUDA.sync GradVersion0_Kernel($V_GPU_Kernel)
println("V_CPU_Bench")
display(V_CPU_Bench)
println("V_GPU_Simple_Bench")
display(V_GPU_Simple_Bench)
println("V_GPU_Kernel_Bench")
display(V_GPU_Kernel_Bench)
The code now consists of one immutable struct ∇ᵢWᵢⱼStruct
and its helper struct to turn it to the GPU using Adapt, ∇ᵢWᵢⱼDeviceStruct
. This is done through the “adapt_structure” functionality highlighted in the link I referenced in my first post.
The really important part of ∇ᵢWᵢⱼStruct
is that it is now basedd on AbstractArray
instead of being limited to CPU/GPU specifically - this allows for allocating the structure on both CPU and GPU and run the exact same code on it, so that is really great! - Also what the benchmark example does, for 10*10^6 elements we see:
0.104691 seconds (26 CPU allocations: 1.172 KiB)
0.015593 seconds (225 CPU allocations: 17.391 KiB)
0.013344 seconds (80 CPU allocations: 4.875 KiB)
All three approaches produce the same result with rtol=1e-6.
V_CPU_Bench
BenchmarkTools.Trial: 69 samples with 1 evaluation.
Range (min … max): 69.430 ms … 105.620 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 72.026 ms ┊ GC (median): 0.00%
Time (mean ± σ): 73.314 ms ± 4.907 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▂█▁ ▁ ▂
▅▁██████▁▆█▅█▃▃▃▅▅▆▁▃▁▅▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
69.4 ms Histogram: frequency by time 89.4 ms <
Memory estimate: 1.17 KiB, allocs estimate: 26.
V_GPU_Simple_Bench
BenchmarkTools.Trial: 1863 samples with 1 evaluation.
Range (min … max): 1.934 ms … 56.218 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.164 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.668 ms ± 2.434 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
██▅▄▄▄▂
█████████▇▆▅▅▅▆▄▅▅▆▅▅▄▄▄▄▄▅▅▁▁▅▄▄▄▄▄▅▁▁▁▄▄▄▄▄▁▁▁▁▄▄▁▁▁▁▁▁▄ █
1.93 ms Histogram: log(frequency) by time 12.5 ms <
Memory estimate: 17.27 KiB, allocs estimate: 222.
V_GPU_Kernel_Bench
BenchmarkTools.Trial: 2343 samples with 1 evaluation.
Range (min … max): 1.732 ms … 83.638 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.969 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.123 ms ± 1.767 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▅██▄
▂▃▄▄▆▇█████▇▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▃▃▃▃▃▃▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
1.73 ms Histogram: frequency by time 3.21 ms <
Memory estimate: 4.84 KiB, allocs estimate: 79.
So this is extremely cool, being able to do both CPU/GPU with no hassle!
On top of this now the GPU functions accept the full struct with no problem due to:
isbits(cudaconvert(V_GPU_Kernel)) # true
So that is really awesome, so happy.
Thank you to @mcabbott for helping me understanding how to do this with an immutable struct.
Thank you to @lodygaw for the initial question and solution for how to do this, in a way I could understand.
So now I can get great performance, easy code to structure and organise - Julia is so cool right now