How to properly pass structs into GPU? (MWE included)

Thank God, I was finally able to do it exactly how I wanted! And this answers my 5th question:

5. Is the way shown here the intended and “state of the art” approach for using structs in GPU programming - if not, what is? I looked at the interpolate example in the docs, and I struggle with grasping it.

Short answer, no. The code below is in my opinion much better:

# Packages
using CUDA
using Parameters
using StaticArrays
using Adapt
using Setfield
using LinearAlgebra
using BenchmarkTools

# Constants
const T     = Float32
const ST    = 3
const H     = T(0.5)
const αD    = T((7/(4*π*H^2)))
const ϵ     = T(1e-6)
const NP    = 10^4
const NL    = 10^7

# Main Struct for Gradient Values
@with_kw struct ∇ᵢWᵢⱼStruct{T,ST}
    NL::Base.RefValue{Int}

    # Input for calculation
    xᵢⱼ     ::AbstractVector{SVector{ST,T}} = zeros(SVector{ST,T},NL[])
    xᵢⱼ²    ::AbstractVector{T}             = similar(xᵢⱼ,T,NL[])
    dᵢⱼ     ::AbstractVector{T}             = similar(xᵢⱼ,T,NL[])
    qᵢⱼ     ::AbstractVector{T}             = similar(xᵢⱼ,T,NL[])
    ∇ᵢWᵢⱼ   ::AbstractVector{SVector{ST,T}} = similar(xᵢⱼ,NL[])
end

# Helper struct to convert onto GPU
struct ∇ᵢWᵢⱼDeviceStruct{ST,T}
    xᵢⱼ     ::CuDeviceVector{SVector{ST,T}, 1}
    xᵢⱼ²    ::CuDeviceVector{T, 1}             
    dᵢⱼ     ::CuDeviceVector{T, 1}             
    qᵢⱼ     ::CuDeviceVector{T, 1}             
    ∇ᵢWᵢⱼ   ::CuDeviceVector{SVector{ST,T}, 1} 
end

# Adapt Structure for Conversion to GPU
Adapt.adapt_structure(to, s::∇ᵢWᵢⱼStruct) = ∇ᵢWᵢⱼDeviceStruct(
                                                        adapt(to, s.xᵢⱼ  ),
                                                        adapt(to, s.xᵢⱼ² ),
                                                        adapt(to, s.dᵢⱼ  ),
                                                        adapt(to, s.qᵢⱼ  ),
                                                        adapt(to, s.∇ᵢWᵢⱼ),
                                                        )


# Simple Grad Version
function GradVersion0(V::∇ᵢWᵢⱼStruct)
    V.xᵢⱼ²    .= dot.(V.xᵢⱼ,V.xᵢⱼ)
    V.dᵢⱼ     .= sqrt.(V.xᵢⱼ²)
    V.qᵢⱼ     .= V.dᵢⱼ ./ H

    @. V.∇ᵢWᵢⱼ = αD *5 * (V.qᵢⱼ- 2)^3 * V.qᵢⱼ / (8H * (V.qᵢⱼ * H + ϵ)) * V.xᵢⱼ

    return nothing
end

### --------------------------- Writing Our Own Custom Kernel  ---------------------------

# Actual Calculation of Gradient in Kernel
function GPU_GRAD(V)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    for i = index:stride:length(V.xᵢⱼ)
        xdot                  = dot(V.xᵢⱼ[i],V.xᵢⱼ[i])
        @inbounds V.xᵢⱼ²[i]  += xdot
        dist                  = sqrt(xdot)
        @inbounds V.dᵢⱼ[i]   += dist
        q                     = dist/H
        @inbounds V.qᵢⱼ[i]  += q
        wg                    = αD * 5 * (q - 2)^3 * q / (8H * (q * H + ϵ)) * V.xᵢⱼ[i]
        @inbounds V.∇ᵢWᵢⱼ[i] += wg
    end
    return nothing
end

# Actual Function to Call for Kernel Gradient Calc
function GradVersion0_Kernel(V)
    kernel = @cuda launch=false GPU_GRAD(V)
    config = launch_configuration(kernel.fun)
    threads = min(length(V.xᵢⱼ), config.threads)
    blocks = cld(length(V.xᵢⱼ), threads)

    CUDA.@sync begin
        kernel(V; threads, blocks)
    end
end

### --------------------------- Benchmarking  ---------------------------
IniX         = rand(SVector{ST,T},NL)
V_CPU        = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=IniX)
V_GPU_Simple = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=CuArray(IniX))
V_GPU_Kernel = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=CuArray(IniX))

@CUDA.time GradVersion0(V_CPU)
@CUDA.time GradVersion0(V_GPU_Simple)
@CUDA.time GradVersion0_Kernel(V_GPU_Kernel)

if isapprox(V_CPU.∇ᵢWᵢⱼ,Array(V_GPU_Simple.∇ᵢWᵢⱼ), rtol=1e-6) & isapprox(V_CPU.∇ᵢWᵢⱼ,Array(V_GPU_Kernel.∇ᵢWᵢⱼ), rtol=1e-6)
    println("All three approaches produce the same result with rtol=1e-6.")
end

V_CPU_Bench          = @benchmark GradVersion0($V_CPU)
# I could not get sensible timings other than including @CUDA.sync here too, and not only in GradVersion0_Kernel
V_GPU_Simple_Bench   = @benchmark @CUDA.sync GradVersion0($V_GPU_Simple)
V_GPU_Kernel_Bench   = @benchmark @CUDA.sync GradVersion0_Kernel($V_GPU_Kernel)

println("V_CPU_Bench")
display(V_CPU_Bench)
println("V_GPU_Simple_Bench")
display(V_GPU_Simple_Bench)
println("V_GPU_Kernel_Bench")
display(V_GPU_Kernel_Bench)

The code now consists of one immutable struct ∇ᵢWᵢⱼStruct and its helper struct to turn it to the GPU using Adapt, ∇ᵢWᵢⱼDeviceStruct. This is done through the “adapt_structure” functionality highlighted in the link I referenced in my first post.

The really important part of ∇ᵢWᵢⱼStruct is that it is now basedd on AbstractArray instead of being limited to CPU/GPU specifically - this allows for allocating the structure on both CPU and GPU and run the exact same code on it, so that is really great! - Also what the benchmark example does, for 10*10^6 elements we see:

  0.104691 seconds (26 CPU allocations: 1.172 KiB)
  0.015593 seconds (225 CPU allocations: 17.391 KiB)
  0.013344 seconds (80 CPU allocations: 4.875 KiB)
All three approaches produce the same result with rtol=1e-6.
V_CPU_Bench
BenchmarkTools.Trial: 69 samples with 1 evaluation.
 Range (min … max):  69.430 ms … 105.620 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     72.026 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   73.314 ms ±   4.907 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂█▁ ▁  ▂  
  ▅▁██████▁▆█▅█▃▃▃▅▅▆▁▃▁▅▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  69.4 ms         Histogram: frequency by time         89.4 ms <

 Memory estimate: 1.17 KiB, allocs estimate: 26.
V_GPU_Simple_Bench
BenchmarkTools.Trial: 1863 samples with 1 evaluation.
 Range (min … max):  1.934 ms … 56.218 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.164 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.668 ms ±  2.434 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▅▄▄▄▂
  █████████▇▆▅▅▅▆▄▅▅▆▅▅▄▄▄▄▄▅▅▁▁▅▄▄▄▄▄▅▁▁▁▄▄▄▄▄▁▁▁▁▄▄▁▁▁▁▁▁▄ █
  1.93 ms      Histogram: log(frequency) by time     12.5 ms <

 Memory estimate: 17.27 KiB, allocs estimate: 222.
V_GPU_Kernel_Bench
BenchmarkTools.Trial: 2343 samples with 1 evaluation.
 Range (min … max):  1.732 ms … 83.638 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.969 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.123 ms ±  1.767 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▅██▄     
  ▂▃▄▄▆▇█████▇▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▃▃▃▃▃▃▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  1.73 ms        Histogram: frequency by time        3.21 ms <

 Memory estimate: 4.84 KiB, allocs estimate: 79.

So this is extremely cool, being able to do both CPU/GPU with no hassle!

On top of this now the GPU functions accept the full struct with no problem due to:

isbits(cudaconvert(V_GPU_Kernel)) # true

So that is really awesome, so happy.

Thank you to @mcabbott for helping me understanding how to do this with an immutable struct.

Thank you to @lodygaw for the initial question and solution for how to do this, in a way I could understand.

So now I can get great performance, easy code to structure and organise - Julia is so cool right now :slight_smile:

1 Like