CUDA.jl for particle tracking simulation

Im trying to implement some consolidated CPU functions for particle tracking simulations (models of synchrotron accelerators) in GPU using CUDA.jl. The goal is to be able to paralelize tracking routines.

I had tried to start with the simplest function: a drift.

In CPU, the drift function is define as:

function drift(pos::Pos{T}, length::Float64)
    pnorm::T = 1 / (1 + pos.de)
    norml::T = pnorm * length
    pos.rx += norml * pos.px
    pos.ry += norml * pos.py
    pos.dl += 0.5 * norml * pnorm * (pos.px^2 + pos.py^2)
end 

Where the object Pos (a single electron state) is basicaly:

mutable struct Pos{T}
    rx::T
    px::T
    ry::T
    py::T
    de::T
    dl::T
end 

My GPU first implementation of the drift function is:

function drift(pos, length)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= size(pos)[2]
        @inbounds pnorm = 1 / (1 + pos[5, i])
        norml = pnorm * length
        @inbounds pos[1, i] += norml * pos[2,i]
        @inbounds pos[3, i] += norml * pos[4,i]
        @inbounds pos[6, i] += 0.5 * norml * pnorm * (pos[2,i]*pos[2,i] + pos[4,i]*pos[4,i])
    end
    return
end

Where the pos argument is now a CuArray (CuDeviceVector) with the dimension = (6, number_of_particles).

Then, I wrote the following function (multiple_drifts) for tracking the particles allong a line of multiple drifts.

function multiple_drifts(pos, lengths)
    for l in lengths
        drift(pos, l)
    end
    return
end

To do the comparison between the CPU and GPU functions i did:

dim::Int = 1 # the number of electrons
gpulengths = CUDA.rand(Float64, 6000*20) # the lengths of the drifts to pass
cpulengths = Array(gpulengths) # the CPU array of the lengths

nthreads = 256;
nblocks = Int(ceil(dim/256));

# GPU drift pass
# creation of the particle (all rx, px, ry, py, de, dl = 1e-6) {Float64}
particles = CUDA.fill(1e-6, (6, dim)); 
# the computation
t_gpu = CUDA.@elapsed begin CUDA.@sync @cuda threads=nthreads blocks=nblocks multiple_drifts(particles, lengs) end; 

#CPU drift pass
# creation of the particle (all rx, px, ry, py, de, dl = 1e-6) {Float64}
particle = Pos(1e-6); # using my constructor
# creation of the accelerator model
model = Accelerator()
# adding the lattice
model.lattice = [Elements.drift(length = l) for l in cpulengths] 
# the computation
t_cpu = @elapsed begin p_final = line_pass(model, p) end;

The final state of the particles is equally right… but the time is much slower in the GPU (35 ~ 40 times slower). The benchmarks are:

# CPU benchmark
# code line: p = Pos(1e-6); @benchmark begin line_pass(acc, $p, "end") end
BenchmarkTools.Trial: 1428 samples with 1 evaluation.
 Range (min … max):  2.686 ms … 21.966 ms  β”Š GC (min … max): 0.00% … 65.95%
 Time  (median):     3.275 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.487 ms Β±  2.084 ms  β”Š GC (mean Β± Οƒ):  7.82% Β± 10.94%

  β–ˆ β–ˆ                                                         
  β–ˆβ–…β–ˆβ–„β–„β–ƒβ–‚β–‚β–‚β–‚β–β–β–‚β–β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–‚β–‚β–β–β–‚β–β–‚β–‚β–β–‚β–‚ β–‚
  2.69 ms        Histogram: frequency by time        17.9 ms <

 Memory estimate: 1.96 MiB, allocs estimate: 120009.

# GPU benchmark
# code line: particles = CUDA.fill(1e-6, (6, dim)); @benchmark CUDA.@sync @cuda threads=nthreads blocks=nblocks multiple_drifts($particles, $lengs)
BenchmarkTools.Trial: 41 samples with 1 evaluation.
 Range (min … max):  122.164 ms … 124.992 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     123.547 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   123.516 ms Β± 818.585 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒ    β–ƒ  β–ƒβ–ƒ β–ƒ                β–ƒ  β–ƒβ–ƒ             β–ˆ                
  β–ˆβ–‡β–β–β–β–ˆβ–β–β–ˆβ–ˆβ–β–ˆβ–β–β–β–‡β–‡β–β–β–β–‡β–‡β–β–‡β–‡β–β–‡β–β–ˆβ–‡β–β–ˆβ–ˆβ–‡β–β–‡β–β–β–‡β–β–β–β–‡β–β–β–‡β–ˆβ–‡β–‡β–β–‡β–‡β–‡β–‡β–β–β–β–‡β–β–β–‡ ▁
  122 ms           Histogram: frequency by time          125 ms <

 Memory estimate: 4.03 KiB, allocs estimate: 72.

I am really newbie in CUDA.jl and GPU implementations. I researched other things like shared memory, type stabilization, atomic operations etc… but I couldnt solve my problem.

Any help or suggestion?

(If you want more details about the CPU implementations like the line_pass function or the Accelerator and Element structures, please tell me! Ill be happy to clarify.

1 Like

Hi @VitorSouzaLNLS and welcome to this forum! :slight_smile:

this is somewhat expected. You’re doing a serial computation (drift for a number of steps) for just a single particle. That’s not something a GPU is good at. However if you would benchmark a drift of N >> 1 independent particles, the benchmark will look more favorably for the GPU, since it can do the computations for all (or rather many, whether it’s really all depends on N) of the particles in parallel.

1 Like

Hi! @trahflow ! Thanks for your reply!

After your reply, I ran a test with 256 * 100 particles (with the GPU function).

Here is the benchmarks GPU x CPU:

>>> CPU (1 Thread)

BenchmarkTools.Trial: 2284 samples with 1 evaluation.
 Range (min … max):  3.222 ms … 30.346 ms  β”Š GC (min … max): 0.00% … 82.53%
 Time  (median):     4.050 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.361 ms Β±  2.246 ms  β”Š GC (mean Β± Οƒ):  6.67% Β± 10.65%

  β–‡β–‡β–†β–ˆβ–…β–„β–ƒβ–‚β–β– ▁                                               ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„β–β–„β–„β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–…β–β–„β–β–β–β–β–β–„β–ˆ β–ˆ
  3.22 ms      Histogram: log(frequency) by time     19.3 ms <

 Memory estimate: 1.96 MiB, allocs estimate: 120009.

>>> GPU (256 Threads, 100 blocks)

BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  3.033 s …    4.308 s  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.243 s               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.290 s Β± 331.083 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

           β–ˆ                                                  
  β–†β–β–β–β–„β–β–β–β–β–ˆβ–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„ ▁
  3.03 s         Histogram: frequency by time         4.31 s <

 Memory estimate: 4.16 KiB, allocs estimate: 76.

>>> GPU (512 Threads, 50 blocks)

BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  3.250 s …  3.259 s  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.256 s             β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.256 s Β± 2.844 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ β–ˆ                      β–ˆ     β–ˆ β–ˆβ–ˆ  β–ˆ        β–ˆ   β–ˆ  β–ˆβ–ˆβ–ˆ  
  β–ˆβ–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–ˆβ–β–ˆβ–ˆβ–β–β–ˆβ–β–β–β–β–β–β–β–β–ˆβ–β–β–β–ˆβ–β–β–ˆβ–ˆβ–ˆ ▁
  3.25 s        Histogram: frequency by time        3.26 s <

 Memory estimate: 4.14 KiB, allocs estimate: 76.

>>> GPU (768 Threads, 33 blocks)

BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  2.475 s …   2.523 s  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.493 s              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.495 s Β± 18.593 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒ                                                       β–ˆ  
  β–ˆβ–‡β–‡β–β–β–β–β–β–β–β–β–β–β–β–β–‡β–β–β–β–‡β–β–β–β–‡β–‡β–‡β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  2.48 s         Histogram: frequency by time        2.52 s <

 Memory estimate: 4.11 KiB, allocs estimate: 75.

So, in fact, if we take the mean time of the GPU runs and simply divide by the number of particles the time is really speed-up:

(Per particle)

  • mean CPU = 4.361 ms
  • mean GPU (256 threads) = 0.128 ms
  • mean GPU (512 threads) = 0.127 ms
  • mean GPU (768 threads) = 0.097 ms

So, indeed its faster!

I still dont have implemented a parallel evaluation of the CPU functions (line_pass or drift). The benchmarks comparison should be more fair, right? (Instead of comparing the GPU evaluations with 256*100 particles with CPU eval with 1 particle).

One thing I still don’t understand is that the benchmark with 512 threads didn’t show any gain compared to the one with 256, but the one with 768 did (as I expected).

The launch configuration ( 512 Threads, 50 blocks) means that 50 blocks with 512 threads each are being launched.
This does not necessarily mean however, that only 512 threads run in parallel. Afaik the block scheduler can decide to run multiple blocks at the same time if enough resources for a full block are free. The CUDA programming guide might have some more details.

To finetune the launch configuration you would have to do some profiling, as described here.
A good starting point however would be to use the occupancy api. The CUDA.jl docs mention this in the introduction at the bottom of this section

1 Like

You could also try KernelAbstractions.jl which allows you to re-use your code for CPU and GPU.

See this example.

1 Like