Im trying to implement some consolidated CPU functions for particle tracking simulations (models of synchrotron accelerators) in GPU using CUDA.jl. The goal is to be able to paralelize tracking routines.
I had tried to start with the simplest function: a drift.
In CPU, the drift function is define as:
function drift(pos::Pos{T}, length::Float64)
pnorm::T = 1 / (1 + pos.de)
norml::T = pnorm * length
pos.rx += norml * pos.px
pos.ry += norml * pos.py
pos.dl += 0.5 * norml * pnorm * (pos.px^2 + pos.py^2)
end
Where the object Pos
(a single electron state) is basicaly:
mutable struct Pos{T}
rx::T
px::T
ry::T
py::T
de::T
dl::T
end
My GPU first implementation of the drift function is:
function drift(pos, length)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if i <= size(pos)[2]
@inbounds pnorm = 1 / (1 + pos[5, i])
norml = pnorm * length
@inbounds pos[1, i] += norml * pos[2,i]
@inbounds pos[3, i] += norml * pos[4,i]
@inbounds pos[6, i] += 0.5 * norml * pnorm * (pos[2,i]*pos[2,i] + pos[4,i]*pos[4,i])
end
return
end
Where the pos
argument is now a CuArray
(CuDeviceVector
) with the dimension = (6, number_of_particles).
Then, I wrote the following function (multiple_drifts
) for tracking the particles allong a line of multiple drifts.
function multiple_drifts(pos, lengths)
for l in lengths
drift(pos, l)
end
return
end
To do the comparison between the CPU and GPU functions i did:
dim::Int = 1 # the number of electrons
gpulengths = CUDA.rand(Float64, 6000*20) # the lengths of the drifts to pass
cpulengths = Array(gpulengths) # the CPU array of the lengths
nthreads = 256;
nblocks = Int(ceil(dim/256));
# GPU drift pass
# creation of the particle (all rx, px, ry, py, de, dl = 1e-6) {Float64}
particles = CUDA.fill(1e-6, (6, dim));
# the computation
t_gpu = CUDA.@elapsed begin CUDA.@sync @cuda threads=nthreads blocks=nblocks multiple_drifts(particles, lengs) end;
#CPU drift pass
# creation of the particle (all rx, px, ry, py, de, dl = 1e-6) {Float64}
particle = Pos(1e-6); # using my constructor
# creation of the accelerator model
model = Accelerator()
# adding the lattice
model.lattice = [Elements.drift(length = l) for l in cpulengths]
# the computation
t_cpu = @elapsed begin p_final = line_pass(model, p) end;
The final state of the particles is equally right⦠but the time is much slower in the GPU (35 ~ 40 times slower). The benchmarks are:
# CPU benchmark
# code line: p = Pos(1e-6); @benchmark begin line_pass(acc, $p, "end") end
BenchmarkTools.Trial: 1428 samples with 1 evaluation.
Range (min β¦ max): 2.686 ms β¦ 21.966 ms β GC (min β¦ max): 0.00% β¦ 65.95%
Time (median): 3.275 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.487 ms Β± 2.084 ms β GC (mean Β± Ο): 7.82% Β± 10.94%
β β
ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2.69 ms Histogram: frequency by time 17.9 ms <
Memory estimate: 1.96 MiB, allocs estimate: 120009.
# GPU benchmark
# code line: particles = CUDA.fill(1e-6, (6, dim)); @benchmark CUDA.@sync @cuda threads=nthreads blocks=nblocks multiple_drifts($particles, $lengs)
BenchmarkTools.Trial: 41 samples with 1 evaluation.
Range (min β¦ max): 122.164 ms β¦ 124.992 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 123.547 ms β GC (median): 0.00%
Time (mean Β± Ο): 123.516 ms Β± 818.585 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β β ββ β β ββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
122 ms Histogram: frequency by time 125 ms <
Memory estimate: 4.03 KiB, allocs estimate: 72.
I am really newbie in CUDA.jl and GPU implementations. I researched other things like shared memory, type stabilization, atomic operations etc⦠but I couldnt solve my problem.
Any help or suggestion?
(If you want more details about the CPU implementations like the line_pass function or the Accelerator and Element structures, please tell me! Ill be happy to clarify.