Hello!
Lately I’ve been asking a few questions about how to program on GPU etc. and I’ve gotten some answers (thank you very much!), but of course it is hard to help me if I only provide a few line of codes. So this morning I sat down trying to make a more challenging MWE performing a calculation of very high importance for me.
My aim is to understand how to properly use structs in GPU programming and using them to organize/store data. I found a post a while back here:
Where I understand one of the answers being that the struct has to be immutable. So I decided to test this out, using the full code here:
using CUDA
using Parameters
using StaticArrays
import Adapt
using Setfield
using LinearAlgebra
using BenchmarkTools
const T = Float32
const ST = 3
const H = T(0.5)
const αD = T((7/(4*π*H^2)))
const ϵ = T(1e-6)
const NP = 10^4
const NL = 10^5 # By lowering NL MStruct wins out right now. By increasing they become equal
@with_kw struct ∇ᵢWᵢⱼStruct{T,ST}
NL::Int
# Input for calculation
xᵢⱼ ::CuArray{SVector{ST,T}} = zeros(SVector{ST,T},NL)
xᵢⱼ² ::CuArray{T} = zeros(T,NL)
dᵢⱼ ::CuArray{T} = similar(xᵢⱼ²)
qᵢⱼ ::CuArray{T} = similar(xᵢⱼ²)
∇ᵢWᵢⱼ ::CuArray{SVector{ST,T}} = similar(xᵢⱼ)
end
@with_kw mutable struct ∇ᵢWᵢⱼMutableStruct{T,ST}
NL::Int
# Input for calculation
xᵢⱼ ::CuArray{SVector{ST,T}} = zeros(SVector{ST,T},NL)
xᵢⱼ² ::CuArray{T} = zeros(T,NL)
dᵢⱼ ::CuArray{T} = similar(xᵢⱼ²)
qᵢⱼ ::CuArray{T} = similar(xᵢⱼ²)
∇ᵢWᵢⱼ ::CuArray{SVector{ST,T}} = similar(xᵢⱼ)
end
Adapt.@adapt_structure ∇ᵢWᵢⱼStruct{T,ST}
Adapt.@adapt_structure ∇ᵢWᵢⱼMutableStruct{T,ST}
function GradVersion1(V::∇ᵢWᵢⱼStruct)
V = @set V.xᵢⱼ² = dot.(V.xᵢⱼ,V.xᵢⱼ)
V = @set V.dᵢⱼ = sqrt.(V.xᵢⱼ²)
V = @set V.qᵢⱼ = V.dᵢⱼ ./ H
V = @set V.∇ᵢWᵢⱼ = @. αD *5 * (V.qᵢⱼ- 2)^3 * V.qᵢⱼ / (8H * (V.qᵢⱼ * H + ϵ)) * V.xᵢⱼ
return V
end
function GradVersion2(V::∇ᵢWᵢⱼMutableStruct)
V.xᵢⱼ² .= dot.(V.xᵢⱼ,V.xᵢⱼ)
V.dᵢⱼ .= sqrt.(V.xᵢⱼ²)
V.qᵢⱼ .= V.dᵢⱼ ./ H
@. V.∇ᵢWᵢⱼ = αD *5 * (V.qᵢⱼ- 2)^3 * V.qᵢⱼ / (8H * (V.qᵢⱼ * H + ϵ)) * V.xᵢⱼ
return nothing
end
# Testing V1
V1 = ∇ᵢWᵢⱼStruct{T,ST}(NL=NL)
V1 = @set V1.xᵢⱼ = CuArray(rand(SVector{ST,T},NL))
V1 = GradVersion1(V1)
# Testing V2
V2 = ∇ᵢWᵢⱼMutableStruct{T,ST}(NL=NL)
V2.xᵢⱼ .= V1.xᵢⱼ
GradVersion2(V2)
# Verify the same calculation is being performed
if all(V1.∇ᵢWᵢⱼ.≈ V2.∇ᵢWᵢⱼ)
println("The calculation using both approaches to struct, outputs the same result.")
end
# Now benchmark
# Check allocations
println("|---------------------- Allocations ---------------------------|")
println("Allocated by Struct: ", @CUDA.allocated GradVersion1(V1))
println("Allocated by MStruct: ", @CUDA.allocated GradVersion2(V2))
println("|---------------------- CUDA Timings ---------------------------|")
print("@CUDA.time by Struct:")
@CUDA.time GradVersion1(V1)
print("@CUDA.time by MStruct:")
@CUDA.time GradVersion2(V2)
println("|---------------------- BenchmarkTools ---------------------------|")
V1Bench = @benchmark @CUDA.sync V1 = GradVersion1($V1);
V2Bench = @benchmark @CUDA.sync GradVersion2($V2);
println("Benchmark by Struct:")
display(V1Bench)
println("Benchmark by MStruct:")
display(V2Bench)
Which defines the same exact struct, just in a immutable and mutable fashion. The result of executing this script on my pc is:
The calculation using both approaches to struct, outputs the same result.
|---------------------- Allocations ---------------------------|
Allocated by Struct: 2400000
Allocated by MStruct: 0
|---------------------- CUDA Timings ---------------------------|
@CUDA.time by Struct: 0.001189 seconds (297 CPU allocations: 20.391 KiB) (4 GPU allocations: 2.289 MiB, 1.83% memmgmt time)
@CUDA.time by MStruct: 0.001433 seconds (223 CPU allocations: 17.281 KiB)
|---------------------- BenchmarkTools ---------------------------|
Benchmark by Struct:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 94.500 μs … 31.684 ms ┊ GC (min … max): 0.00% … 38.27%
Time (median): 101.600 μs ┊ GC (median): 0.00%
Time (mean ± σ): 116.445 μs ± 434.508 μs ┊ GC (mean ± σ): 2.01% ± 0.54%
▅█▇▆▆▆▄▄▃▂▁▁ ▂▁▁▁ ▂
██████████████▇█████████▇█▇▇▇▇▆▆▇▆▅▅▅▅▅▆▅▆▇▆▆▅▅▅▅▅▄▄▅▅▅▅▄▅▄▄▄ █
94.5 μs Histogram: log(frequency) by time 240 μs <
Memory estimate: 16.94 KiB, allocs estimate: 245.
Benchmark by MStruct:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 71.900 μs … 10.271 ms ┊ GC (min … max): 0.00% … 98.05%
Time (median): 80.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 94.684 μs ± 133.403 μs ┊ GC (mean ± σ): 1.90% ± 1.38%
▆█▇▆▅▅▄▃▃▃▄▄▄▄▄▂▁▁ ▃▃▂▁▁▁ ▁▂▂▂▂▁▁▁ ▂
████████████████████▇███████▇█▇█████████▇█▇▇▇▅▆▅▆▆▆▄▅▅▅▄▄▂▄▅ █
71.9 μs Histogram: log(frequency) by time 200 μs <
Memory estimate: 13.83 KiB, allocs estimate: 171.
And my questions are;
-
To me it seems using the mutable struct gives cleaner code and faster execution? Am I doing something blatantly wrong?
-
I’ve have tried some simple kernel programming exercises but I have to admit I am not really good at it. Here I tried to use all fused operators etc. for the fastest performance at a high level. Based on your experience would a hand-written kernel be able to speed this up with a factor 2 at minimum?
-
At NL = 10^6 and above, timings become almost identical between the two approaches. How come? The first approach using the immutable struct allocates, the mutable one does not, how can they be equally fast?
-
If there is some obvious way to improve the speed of the function (for example maybe using “sqrt” is not the best option here), please let me know
-
Is the way shown here the intended and “state of the art” approach for using structs in GPU programming - if not, what is? I looked at the interpolate example in the docs, and I struggle with grasping it.
-
How come the CPU will allocate ~20 KIB? Is it transferring data already on the GPU? And can I reduce it or is it just something “visual”, and @CUDA.allocate is the truth?
If you cannot answer all of the questions, I highly appreciate your help with even one of them
Kind regards, thank you for your time