When to parallelise, when not to?

Hey all,
I’m stuck on understanding when to realize that the code will benefit from parallel processing, when not; when to take it on GPUs when not.

I have a code that runs as fast as it can (that’s what I think, could be wrong too) since it’s mostly in-place operations and I think it can benefit from parallelizations, though using Threads.@threads doesn’t speed it up, and in fact using GPUs is also not a good option in this case.

I attach the snippet below, and I’d appreciate your suggestions, though a little on how to learn about them would be very much helpful.

function pseudo_G!(G_per_rec, rickshift, pa, ir)
    for ix in 1:pa.nx, iy in 1:pa.ny, iz in 1:pa.nz, iT in pa.nT
            Gg= view(G_per_rec, :,1,iz,iy,ix,iT);
            circshift!(Gg, pa.rick0, pa.Tshift[iT]+ pa.shift[iz,iy,ix,ir])
    end
end

function data_per_rec!(dd, m, G_per_rec, rickshift, pa, ir)
    pseudo_G!(reshape(G_per_rec, (pa.nt,1,pa.nz, pa.ny, pa.nx, pa.nT)), rickshift, pa, ir)
    mul!(dd, G_per_rec, m);
end

function get_data!(d, m, G_per_rec, rickshift, pa)
    for ir in 1:pa.nr
        # dd= view(d, :, ir:ir)
        data_per_rec!(view(d, :, ir:ir), m, G_per_rec, rickshift, pa, ir)
    end
end

dtr= zeros(nt,nr)
mtr= zeros(nz*ny*nx*nT,1);

get_data!(dtr, mtr, G_per_rec, rickshift, pa);

Using @time get_data!(...), I get
48.761478 seconds (37.38 k allocations: 2.049 MiB, 0.03% compilation time). I can do away with memory allocations as well which happen because of using reshape(G_per_rec, ...) while passing it to pseudo_G!(...) in data_per_rec! function, but that doesn’t really speed things up.

where pa is (to avoid bottlenecks because of global variables)

mutable struct Params
    nt::Int64
    nr::Int64
    nx::Int64
    ny::Int64
    nz::Int64 
    nT::Int64 
    rick0::Vector{Float16} 
    shift::Array{Int64,4}
    Tshift::Vector{Int64}
end

pa= Params(nt, nr, nx, ny, nz, nT, rick0, shift, Tshift);

Thanks in advance!

This is a really big topic, but in general in comes down to the specific task you are using. If you want to take a deeper dive on optimisation, there are some great resources here:

A few quick tips to help:

  • Always benchmark your changes - @btime from benchmark tools is more reliable than @time - it might be helpful to use a smaller input for this test
  • See how your code scales with input size, this helps you get a feel for at which point the overhead of parallelising your code would be worth it
  • Make sure that you are running Julia with multiple threads - (use Threads.nthreads() to check)
  • Make sure you have no race conditions with whichever for loop you use Threads.@threads with? Does the loop have to be done in order?

I have a (not so clean) repository with some examples of progressively parallelising code and measuring the performance - on a simple problem -

It contains a few figures to show performance differences. You only really want to look at “main.jl”. It shows threading and even GPU parallelisation. Its not the best, but has a few examples.

I see two ways: either you know theoretically that your algorithm is (embarrassingly?) parallel or you analyze and experiment (which you seem to do).

You could check that? I usually profile the serial code to try to make sure that there is no unintended bottleneck (w.r.t. allocations and dynamic dispatch).

I can’t reproduce this, because the concrete parameter values are missing. In any case I’d recommend using @btime because @time could include compilation.

From the looks of it I’d expect one or two hotspots, pseudo_G! and/or mul!. I’d suspect the former could benefit from LoopVectorization and maybe parallelization, the latter could dispatch to a BLAS call (did you consider using MKL?), which will be parallelized by BLAS itself.

If LoopVectorization doesn’t help, Polyester could still be more appropriate for this kernel than Threads.@threads.

Regarding GPU optimizations others have to chime in.