# When to parallelise, when not to?

Hey all,
I’m stuck on understanding when to realize that the code will benefit from parallel processing, when not; when to take it on GPUs when not.

I have a code that runs as fast as it can (that’s what I think, could be wrong too) since it’s mostly in-place operations and I think it can benefit from parallelizations, though using `Threads.@threads` doesn’t speed it up, and in fact using GPUs is also not a good option in this case.

I attach the snippet below, and I’d appreciate your suggestions, though a little on how to learn about them would be very much helpful.

``````function pseudo_G!(G_per_rec, rickshift, pa, ir)
for ix in 1:pa.nx, iy in 1:pa.ny, iz in 1:pa.nz, iT in pa.nT
Gg= view(G_per_rec, :,1,iz,iy,ix,iT);
circshift!(Gg, pa.rick0, pa.Tshift[iT]+ pa.shift[iz,iy,ix,ir])
end
end

function data_per_rec!(dd, m, G_per_rec, rickshift, pa, ir)
pseudo_G!(reshape(G_per_rec, (pa.nt,1,pa.nz, pa.ny, pa.nx, pa.nT)), rickshift, pa, ir)
mul!(dd, G_per_rec, m);
end

function get_data!(d, m, G_per_rec, rickshift, pa)
for ir in 1:pa.nr
# dd= view(d, :, ir:ir)
data_per_rec!(view(d, :, ir:ir), m, G_per_rec, rickshift, pa, ir)
end
end

dtr= zeros(nt,nr)
mtr= zeros(nz*ny*nx*nT,1);

get_data!(dtr, mtr, G_per_rec, rickshift, pa);
``````

Using `@time get_data!(...)`, I get
`48.761478 seconds (37.38 k allocations: 2.049 MiB, 0.03% compilation time)`. I can do away with memory allocations as well which happen because of using `reshape(G_per_rec, ...)` while passing it to `pseudo_G!(...)` in `data_per_rec!` function, but that doesn’t really speed things up.

where `pa` is (to avoid bottlenecks because of global variables)

``````mutable struct Params
nt::Int64
nr::Int64
nx::Int64
ny::Int64
nz::Int64
nT::Int64
rick0::Vector{Float16}
shift::Array{Int64,4}
Tshift::Vector{Int64}
end

pa= Params(nt, nr, nx, ny, nz, nT, rick0, shift, Tshift);
``````

This is a really big topic, but in general in comes down to the specific task you are using. If you want to take a deeper dive on optimisation, there are some great resources here:

A few quick tips to help:

• Always benchmark your changes - `@btime` from benchmark tools is more reliable than `@time` - it might be helpful to use a smaller input for this test
• See how your code scales with input size, this helps you get a feel for at which point the overhead of parallelising your code would be worth it
• Make sure that you are running Julia with multiple threads - (use `Threads.nthreads()` to check)
• Make sure you have no race conditions with whichever for loop you use `Threads.@threads` with? Does the loop have to be done in order?

I have a (not so clean) repository with some examples of progressively parallelising code and measuring the performance - on a simple problem -

It contains a few figures to show performance differences. You only really want to look at “main.jl”. It shows threading and even GPU parallelisation. Its not the best, but has a few examples.

I see two ways: either you know theoretically that your algorithm is (embarrassingly?) parallel or you analyze and experiment (which you seem to do).

You could check that? I usually profile the serial code to try to make sure that there is no unintended bottleneck (w.r.t. allocations and dynamic dispatch).

I can’t reproduce this, because the concrete parameter values are missing. In any case I’d recommend using `@btime` because `@time` could include compilation.

From the looks of it I’d expect one or two hotspots, `pseudo_G!` and/or `mul!`. I’d suspect the former could benefit from `LoopVectorization` and maybe parallelization, the latter could dispatch to a BLAS call (did you consider using MKL?), which will be parallelized by BLAS itself.

If `LoopVectorization` doesn’t help, `Polyester` could still be more appropriate for this kernel than `Threads.@threads`.

Regarding GPU optimizations others have to chime in.