How to properly pass structs into GPU? (MWE included)

Hello!

Lately I’ve been asking a few questions about how to program on GPU etc. and I’ve gotten some answers (thank you very much!), but of course it is hard to help me if I only provide a few line of codes. So this morning I sat down trying to make a more challenging MWE performing a calculation of very high importance for me.

My aim is to understand how to properly use structs in GPU programming and using them to organize/store data. I found a post a while back here:

Where I understand one of the answers being that the struct has to be immutable. So I decided to test this out, using the full code here:

using CUDA
using Parameters
using StaticArrays
import Adapt
using Setfield
using LinearAlgebra
using BenchmarkTools

const T     = Float32
const ST    = 3
const H     = T(0.5)
const αD    = T((7/(4*π*H^2)))
const ϵ     = T(1e-6)
const NP    = 10^4
const NL    = 10^5 # By lowering NL MStruct wins out right now. By increasing they become equal

@with_kw struct ∇ᵢWᵢⱼStruct{T,ST}
    NL::Int

    # Input for calculation
    xᵢⱼ     ::CuArray{SVector{ST,T}} = zeros(SVector{ST,T},NL)
    xᵢⱼ²    ::CuArray{T}             = zeros(T,NL)
    dᵢⱼ     ::CuArray{T}             = similar(xᵢⱼ²)
    qᵢⱼ     ::CuArray{T}             = similar(xᵢⱼ²)
    ∇ᵢWᵢⱼ   ::CuArray{SVector{ST,T}} = similar(xᵢⱼ)
end

@with_kw mutable struct ∇ᵢWᵢⱼMutableStruct{T,ST}
    NL::Int

    # Input for calculation
    xᵢⱼ     ::CuArray{SVector{ST,T}} = zeros(SVector{ST,T},NL)
    xᵢⱼ²    ::CuArray{T}             = zeros(T,NL)
    dᵢⱼ     ::CuArray{T}             = similar(xᵢⱼ²)
    qᵢⱼ     ::CuArray{T}             = similar(xᵢⱼ²)
    ∇ᵢWᵢⱼ   ::CuArray{SVector{ST,T}} = similar(xᵢⱼ)
end

Adapt.@adapt_structure ∇ᵢWᵢⱼStruct{T,ST}
Adapt.@adapt_structure ∇ᵢWᵢⱼMutableStruct{T,ST}

function GradVersion1(V::∇ᵢWᵢⱼStruct)
    V = @set V.xᵢⱼ²    = dot.(V.xᵢⱼ,V.xᵢⱼ)
    V = @set V.dᵢⱼ     = sqrt.(V.xᵢⱼ²)
    V = @set V.qᵢⱼ     = V.dᵢⱼ ./ H

    V = @set V.∇ᵢWᵢⱼ   = @. αD *5 * (V.qᵢⱼ- 2)^3 * V.qᵢⱼ / (8H * (V.qᵢⱼ * H + ϵ)) * V.xᵢⱼ

    return V
end

function GradVersion2(V::∇ᵢWᵢⱼMutableStruct)
    V.xᵢⱼ²    .= dot.(V.xᵢⱼ,V.xᵢⱼ)
    V.dᵢⱼ     .= sqrt.(V.xᵢⱼ²)
    V.qᵢⱼ     .= V.dᵢⱼ ./ H

    @. V.∇ᵢWᵢⱼ = αD *5 * (V.qᵢⱼ- 2)^3 * V.qᵢⱼ / (8H * (V.qᵢⱼ * H + ϵ)) * V.xᵢⱼ

    return nothing
end

# Testing V1
V1 = ∇ᵢWᵢⱼStruct{T,ST}(NL=NL)
V1 = @set V1.xᵢⱼ = CuArray(rand(SVector{ST,T},NL))
V1 = GradVersion1(V1)

# Testing V2
V2 = ∇ᵢWᵢⱼMutableStruct{T,ST}(NL=NL)
V2.xᵢⱼ .= V1.xᵢⱼ
GradVersion2(V2)

# Verify the same calculation is being performed
if all(V1.∇ᵢWᵢⱼ.≈ V2.∇ᵢWᵢⱼ)
    println("The calculation using both approaches to struct, outputs the same result.")
end

# Now benchmark

# Check allocations
println("|---------------------- Allocations ---------------------------|")
println("Allocated by  Struct: ", @CUDA.allocated GradVersion1(V1))
println("Allocated by MStruct: ", @CUDA.allocated GradVersion2(V2))

println("|---------------------- CUDA Timings ---------------------------|")
print("@CUDA.time by  Struct:")
@CUDA.time GradVersion1(V1)
print("@CUDA.time by MStruct:")
@CUDA.time GradVersion2(V2)

println("|---------------------- BenchmarkTools ---------------------------|")
V1Bench = @benchmark @CUDA.sync V1 = GradVersion1($V1);
V2Bench = @benchmark @CUDA.sync GradVersion2($V2);
println("Benchmark by  Struct:")
display(V1Bench)
println("Benchmark by MStruct:")
display(V2Bench)

Which defines the same exact struct, just in a immutable and mutable fashion. The result of executing this script on my pc is:

The calculation using both approaches to struct, outputs the same result.
|---------------------- Allocations ---------------------------|
Allocated by  Struct: 2400000
Allocated by MStruct: 0
|---------------------- CUDA Timings ---------------------------|
@CUDA.time by  Struct:  0.001189 seconds (297 CPU allocations: 20.391 KiB) (4 GPU allocations: 2.289 MiB, 1.83% memmgmt time)
@CUDA.time by MStruct:  0.001433 seconds (223 CPU allocations: 17.281 KiB)
|---------------------- BenchmarkTools ---------------------------|       
Benchmark by  Struct:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   94.500 μs …  31.684 ms  ┊ GC (min … max): 0.00% … 38.27%
 Time  (median):     101.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   116.445 μs ± 434.508 μs  ┊ GC (mean ± σ):  2.01% ±  0.54%

  ▅█▇▆▆▆▄▄▃▂▁▁      ▂▁▁▁                                        ▂
  ██████████████▇█████████▇█▇▇▇▇▆▆▇▆▅▅▅▅▅▆▅▆▇▆▆▅▅▅▅▅▄▄▅▅▅▅▄▅▄▄▄ █
  94.5 μs       Histogram: log(frequency) by time        240 μs <

 Memory estimate: 16.94 KiB, allocs estimate: 245.
Benchmark by MStruct:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  71.900 μs …  10.271 ms  ┊ GC (min … max): 0.00% … 98.05%
 Time  (median):     80.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   94.684 μs ± 133.403 μs  ┊ GC (mean ± σ):  1.90% ±  1.38%

  ▆█▇▆▅▅▄▃▃▃▄▄▄▄▄▂▁▁   ▃▃▂▁▁▁    ▁▂▂▂▂▁▁▁                      ▂
  ████████████████████▇███████▇█▇█████████▇█▇▇▇▅▆▅▆▆▆▄▅▅▅▄▄▂▄▅ █
  71.9 μs       Histogram: log(frequency) by time       200 μs <

 Memory estimate: 13.83 KiB, allocs estimate: 171.

And my questions are;

  1. To me it seems using the mutable struct gives cleaner code and faster execution? Am I doing something blatantly wrong?

  2. I’ve have tried some simple kernel programming exercises but I have to admit I am not really good at it. Here I tried to use all fused operators etc. for the fastest performance at a high level. Based on your experience would a hand-written kernel be able to speed this up with a factor 2 at minimum?

  3. At NL = 10^6 and above, timings become almost identical between the two approaches. How come? The first approach using the immutable struct allocates, the mutable one does not, how can they be equally fast?

  4. If there is some obvious way to improve the speed of the function (for example maybe using “sqrt” is not the best option here), please let me know

  5. Is the way shown here the intended and “state of the art” approach for using structs in GPU programming - if not, what is? I looked at the interpolate example in the docs, and I struggle with grasping it.

  6. How come the CPU will allocate ~20 KIB? Is it transferring data already on the GPU? And can I reduce it or is it just something “visual”, and @CUDA.allocate is the truth?

If you cannot answer all of the questions, I highly appreciate your help with even one of them :slight_smile:

Kind regards, thank you for your time

Hello all!

I am happy to share that I found the answer to question number 2:

2. I’ve have tried some simple kernel programming exercises but I have to admit I am not really good at it. Here I tried to use all fused operators etc. for the fastest performance at a high level. Based on your experience would a hand-written kernel be able to speed this up with a factor 2 at minimum?

And the code can be added at the bottom of the previous code snippet to verify:

### --------------------------- Writing Our Own Custom Kernel  ---------------------------

# Actual Calculation
function GPU_GRAD(x,R,D,Q,WG)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    # I had to add in these intermediate variables, else I could
    # get a memory access error (I suppose!) in which it will 
    # lead to Inf in arrays
    for i = index:stride:length(R)
        xdot             = dot(x[i],x[i])
        @inbounds R[i]  += xdot
        dist             = sqrt(xdot)
        @inbounds D[i]  += dist
        q                = dist/H
        @inbounds Q[i]  += q
        wg               = αD * 5 * (q - 2)^3 * q / (8H * (q * H + ϵ)) * x[i]
        @inbounds WG[i] += wg
    end
    return nothing
end

# Helper Function
function GradVersion3(xᵢⱼ,R,D,Q,WG)
    kernel = @cuda launch=false GPU_GRAD(xᵢⱼ,R,D,Q,WG)
    config = launch_configuration(kernel.fun)
    threads = min(length(R), config.threads)
    blocks = cld(length(R), threads)

    CUDA.@sync begin
        kernel(xᵢⱼ,R, D, Q, WG; threads, blocks)
    end
end

# Construct Case
V3 = ∇ᵢWᵢⱼMutableStruct{T,ST}(NL=NL)

V3 = ∇ᵢWᵢⱼMutableStruct{T,ST}(NL=NL)
V3.xᵢⱼ .= CuArray(rand(SVector{ST,T},NL))
GradVersion2(V3)

V4 = ∇ᵢWᵢⱼMutableStruct{T,ST}(NL=NL)
V4.xᵢⱼ .= V3.xᵢⱼ

GradVersion3(V4.xᵢⱼ,V4.xᵢⱼ²,V4.dᵢⱼ,V4.qᵢⱼ,V4.∇ᵢWᵢⱼ)

# Verify the same calculation is being performed
if all(V3.∇ᵢWᵢⱼ.≈ V4.∇ᵢWᵢⱼ)
    println("The calculation with a mutable struct using custom kernel gives the same result")
end

V4Bench = @benchmark GradVersion3($V4.xᵢⱼ,$V4.xᵢⱼ²,$V4.dᵢⱼ,$V4.qᵢⱼ,$V4.∇ᵢWᵢⱼ)
display(V4Bench)

Which outputs:

 Memory estimate: 13.83 KiB, allocs estimate: 171.
The calculation with a mutable struct using custom kernel gives the same result
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  35.500 μs … 233.100 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.900 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.050 μs ±   9.294 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▂▇▇█▇▆▄▄▄▃▄▄▃▃▃▃▂▂▁▁▁ ▂▁▁▁▁                                 ▂
  ████████████████████████████▇██▇█▇▇▆▆▆▅▅▅▅▄▄▅▄▃▃▁▄▄▄▄▁▃▁▃▃▁▄ █
  35.5 μs       Histogram: log(frequency) by time      64.7 μs <

 Memory estimate: 1.50 KiB, allocs estimate: 26.

Which is twice as fast compared to the ~80us from GradVersion2, with fewer CPU memory allocs. And nothing allocated on GPU as expected:

0.001372 seconds (78 CPU allocations: 4.953 KiB)

The code template I used to write this was copy-paste from: Introduction · CUDA.jl (bench_gpu4! and related stuff)

Conclusion: It is best to write your own custom kernel in almost all cases, where you want to combine two operations at a minimum, such as dot and sqrt as far as I can see. This is to avoid the kernel compilation overhead (I think I’ve read some where?)

If anyone sees how to speed this up, feel free to let me know :slight_smile:

The mutability of the structs is irrelevant here. The faster path is mutating CuArrays within (and would work fine with an immutable struct); the slower path with @set is allocating new arrays.

You should probably care about how many GPU operations you perform, this code is 3:

which could be 1, like V.dᵢⱼ .= sqrt.(dot.(V.xᵢⱼ,V.xᵢⱼ)) ./ H. I think that’s why your kernel is faster. For operations taking 30μs, you may find that you’re just timing the number of operations.

Could you explain to me how one would do that? I am not aware of how to change the “CuArrays from within a struct”. I thought immutable meant I had to use a special package like Setfield to do it

EDIT: I think I got it! Basically I was trying to change an immutable struct, but instead I can change the mutable container it holds, in this case the array, using .=.

Link with more info: Julia: How much can we change the objects in immutable struct type? - Stack Overflow


Okay, that is a good take away! In my case I wish to save the intermediate results as well (since I will be using them in other places), but for cases where I don’t need to I will try merging everything into one operation.

Thank you for your answer

Exactly.

OK. You may also want to consider re-computing them later. If you still have its input around, running the same calculation again, fused into a later operation, may be essentially free.

Great!

I even found out that resize! works on the immutable structs mutable container, so I could lift “NL” out of the struct definition which is really neat for me.

Cool, will keep this in mind! I had noticed that if I did two same calculations one after another the second one would be free, but I dont know the term for this or how the compiler would know?

But thank you very much! I now have a version with a immutable struct, of isbits type through the procedure explained in the link I refered to in first post. I will make a short write up later, just so others can see/learn

Kind regards

Thank God, I was finally able to do it exactly how I wanted! And this answers my 5th question:

5. Is the way shown here the intended and “state of the art” approach for using structs in GPU programming - if not, what is? I looked at the interpolate example in the docs, and I struggle with grasping it.

Short answer, no. The code below is in my opinion much better:

# Packages
using CUDA
using Parameters
using StaticArrays
using Adapt
using Setfield
using LinearAlgebra
using BenchmarkTools

# Constants
const T     = Float32
const ST    = 3
const H     = T(0.5)
const αD    = T((7/(4*π*H^2)))
const ϵ     = T(1e-6)
const NP    = 10^4
const NL    = 10^7

# Main Struct for Gradient Values
@with_kw struct ∇ᵢWᵢⱼStruct{T,ST}
    NL::Base.RefValue{Int}

    # Input for calculation
    xᵢⱼ     ::AbstractVector{SVector{ST,T}} = zeros(SVector{ST,T},NL[])
    xᵢⱼ²    ::AbstractVector{T}             = similar(xᵢⱼ,T,NL[])
    dᵢⱼ     ::AbstractVector{T}             = similar(xᵢⱼ,T,NL[])
    qᵢⱼ     ::AbstractVector{T}             = similar(xᵢⱼ,T,NL[])
    ∇ᵢWᵢⱼ   ::AbstractVector{SVector{ST,T}} = similar(xᵢⱼ,NL[])
end

# Helper struct to convert onto GPU
struct ∇ᵢWᵢⱼDeviceStruct{ST,T}
    xᵢⱼ     ::CuDeviceVector{SVector{ST,T}, 1}
    xᵢⱼ²    ::CuDeviceVector{T, 1}             
    dᵢⱼ     ::CuDeviceVector{T, 1}             
    qᵢⱼ     ::CuDeviceVector{T, 1}             
    ∇ᵢWᵢⱼ   ::CuDeviceVector{SVector{ST,T}, 1} 
end

# Adapt Structure for Conversion to GPU
Adapt.adapt_structure(to, s::∇ᵢWᵢⱼStruct) = ∇ᵢWᵢⱼDeviceStruct(
                                                        adapt(to, s.xᵢⱼ  ),
                                                        adapt(to, s.xᵢⱼ² ),
                                                        adapt(to, s.dᵢⱼ  ),
                                                        adapt(to, s.qᵢⱼ  ),
                                                        adapt(to, s.∇ᵢWᵢⱼ),
                                                        )


# Simple Grad Version
function GradVersion0(V::∇ᵢWᵢⱼStruct)
    V.xᵢⱼ²    .= dot.(V.xᵢⱼ,V.xᵢⱼ)
    V.dᵢⱼ     .= sqrt.(V.xᵢⱼ²)
    V.qᵢⱼ     .= V.dᵢⱼ ./ H

    @. V.∇ᵢWᵢⱼ = αD *5 * (V.qᵢⱼ- 2)^3 * V.qᵢⱼ / (8H * (V.qᵢⱼ * H + ϵ)) * V.xᵢⱼ

    return nothing
end

### --------------------------- Writing Our Own Custom Kernel  ---------------------------

# Actual Calculation of Gradient in Kernel
function GPU_GRAD(V)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    for i = index:stride:length(V.xᵢⱼ)
        xdot                  = dot(V.xᵢⱼ[i],V.xᵢⱼ[i])
        @inbounds V.xᵢⱼ²[i]  += xdot
        dist                  = sqrt(xdot)
        @inbounds V.dᵢⱼ[i]   += dist
        q                     = dist/H
        @inbounds V.qᵢⱼ[i]  += q
        wg                    = αD * 5 * (q - 2)^3 * q / (8H * (q * H + ϵ)) * V.xᵢⱼ[i]
        @inbounds V.∇ᵢWᵢⱼ[i] += wg
    end
    return nothing
end

# Actual Function to Call for Kernel Gradient Calc
function GradVersion0_Kernel(V)
    kernel = @cuda launch=false GPU_GRAD(V)
    config = launch_configuration(kernel.fun)
    threads = min(length(V.xᵢⱼ), config.threads)
    blocks = cld(length(V.xᵢⱼ), threads)

    CUDA.@sync begin
        kernel(V; threads, blocks)
    end
end

### --------------------------- Benchmarking  ---------------------------
IniX         = rand(SVector{ST,T},NL)
V_CPU        = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=IniX)
V_GPU_Simple = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=CuArray(IniX))
V_GPU_Kernel = ∇ᵢWᵢⱼStruct{T,ST}(NL=Ref(NL),xᵢⱼ=CuArray(IniX))

@CUDA.time GradVersion0(V_CPU)
@CUDA.time GradVersion0(V_GPU_Simple)
@CUDA.time GradVersion0_Kernel(V_GPU_Kernel)

if isapprox(V_CPU.∇ᵢWᵢⱼ,Array(V_GPU_Simple.∇ᵢWᵢⱼ), rtol=1e-6) & isapprox(V_CPU.∇ᵢWᵢⱼ,Array(V_GPU_Kernel.∇ᵢWᵢⱼ), rtol=1e-6)
    println("All three approaches produce the same result with rtol=1e-6.")
end

V_CPU_Bench          = @benchmark GradVersion0($V_CPU)
# I could not get sensible timings other than including @CUDA.sync here too, and not only in GradVersion0_Kernel
V_GPU_Simple_Bench   = @benchmark @CUDA.sync GradVersion0($V_GPU_Simple)
V_GPU_Kernel_Bench   = @benchmark @CUDA.sync GradVersion0_Kernel($V_GPU_Kernel)

println("V_CPU_Bench")
display(V_CPU_Bench)
println("V_GPU_Simple_Bench")
display(V_GPU_Simple_Bench)
println("V_GPU_Kernel_Bench")
display(V_GPU_Kernel_Bench)

The code now consists of one immutable struct ∇ᵢWᵢⱼStruct and its helper struct to turn it to the GPU using Adapt, ∇ᵢWᵢⱼDeviceStruct. This is done through the “adapt_structure” functionality highlighted in the link I referenced in my first post.

The really important part of ∇ᵢWᵢⱼStruct is that it is now basedd on AbstractArray instead of being limited to CPU/GPU specifically - this allows for allocating the structure on both CPU and GPU and run the exact same code on it, so that is really great! - Also what the benchmark example does, for 10*10^6 elements we see:

  0.104691 seconds (26 CPU allocations: 1.172 KiB)
  0.015593 seconds (225 CPU allocations: 17.391 KiB)
  0.013344 seconds (80 CPU allocations: 4.875 KiB)
All three approaches produce the same result with rtol=1e-6.
V_CPU_Bench
BenchmarkTools.Trial: 69 samples with 1 evaluation.
 Range (min … max):  69.430 ms … 105.620 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     72.026 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   73.314 ms ±   4.907 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂█▁ ▁  ▂  
  ▅▁██████▁▆█▅█▃▃▃▅▅▆▁▃▁▅▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  69.4 ms         Histogram: frequency by time         89.4 ms <

 Memory estimate: 1.17 KiB, allocs estimate: 26.
V_GPU_Simple_Bench
BenchmarkTools.Trial: 1863 samples with 1 evaluation.
 Range (min … max):  1.934 ms … 56.218 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.164 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.668 ms ±  2.434 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▅▄▄▄▂
  █████████▇▆▅▅▅▆▄▅▅▆▅▅▄▄▄▄▄▅▅▁▁▅▄▄▄▄▄▅▁▁▁▄▄▄▄▄▁▁▁▁▄▄▁▁▁▁▁▁▄ █
  1.93 ms      Histogram: log(frequency) by time     12.5 ms <

 Memory estimate: 17.27 KiB, allocs estimate: 222.
V_GPU_Kernel_Bench
BenchmarkTools.Trial: 2343 samples with 1 evaluation.
 Range (min … max):  1.732 ms … 83.638 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.969 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.123 ms ±  1.767 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▅██▄     
  ▂▃▄▄▆▇█████▇▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▃▃▃▃▃▃▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  1.73 ms        Histogram: frequency by time        3.21 ms <

 Memory estimate: 4.84 KiB, allocs estimate: 79.

So this is extremely cool, being able to do both CPU/GPU with no hassle!

On top of this now the GPU functions accept the full struct with no problem due to:

isbits(cudaconvert(V_GPU_Kernel)) # true

So that is really awesome, so happy.

Thank you to @mcabbott for helping me understanding how to do this with an immutable struct.

Thank you to @lodygaw for the initial question and solution for how to do this, in a way I could understand.

So now I can get great performance, easy code to structure and organise - Julia is so cool right now :slight_smile:

1 Like