How should I approach GPU programming?

Hello!

This is the first time I have really dived into GPU programming. I have followed along this tutorial, https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/, and have started getting a feel for how to do some different simple stuff and I see already promising results.

In the end I would like to make a “kernel” which calculates a function like this:

function PackStep(pg,pg_tmp,u,u_tmp,nCoords,nTot)
    @fastmath @inbounds @threads for i = 1:nCoords
        Wgx = 0.0;
        Wgz = 0.0;
        p_i   = pg[i];

        @inbounds @simd for j = 1:nTot
            if (i != j)
                p_j = pg[j];
                rij = p_i .- p_j;
                RIJ = norm(rij);
                RIJ1 = 1.0 / RIJ;
                q   = RIJ/H;
                if (q <= 2)
                    qq3 = q*(q-2)^3;
                    Wq  = AD * FAC * qq3;

                    x_ij = rij[1];
                    z_ij = rij[3];

                    Wgx += Wq * (x_ij * RIJ1) * H1;
                    Wgz += Wq * (z_ij * RIJ1) * H1;
                end
            end
        end

        # Define here since I use it a lot
        u_i = u[i];

        dux = (-BETA * Wgx * V - ZETA * u_i[1])*DT;
        duz = (-BETA * Wgz * V - ZETA * u_i[3])*DT;

        dx  = dux*DT;
        dz  = duz*DT;

        u_tmp[i]  =   u_i   .+ (dux, 0.0, duz)
        pg_tmp[i] =   pg[i] .+ (dx,  0.0, dz)
    end

    @inbounds @simd for i = 1:nCoords
        u[i]  = u_tmp[i];
        pg[i] = pg_tmp[i];
    end

end

To give some basic intuition about the function, it basically calculates how a particle, p_i, is going to move depending on Wgx and Wgz. So basically one calculates for each particle from 1 to nCoords how it should move depending on the other particles, calculated in the innerloop from j = 1:nTot.

The first four inputs are Array{Tuple{Float64,Float64,Float64},2} with N particles in X Y and Z coordinates and the two last are just Int64. The reason for using tuple’s has been that they have given very good performance for CPU implementation, but since they do not allow for in place operations, they seem unfit for GPU’s. Standard arrays seem to have bad performance as well, atleast when I checked with the CPU implementation.

So basically my question is quite openended, since I am new to it - so here are a few questions.

Currently I have been testing with the CUDA options, but maybe something like ArrayFire is smarter?

What should I replace tuple’s with or can I get it to work using GPU?

If anyone have more suggestions for resources to gradually look into regarding GPU programming in Julia please do share.

Kind regards

This is slightly unrelated, but you don’t need this many @fastmath just the one in the outer block will imply it for everything inside.

Haha thanks, fixing that asap.

Kind regards

Same thing for @inbounds, it applies to the whole block, including nested blocks. @simd won’t do anything here, I think, because of the conditionals. Can you re-write to avoid them?

I don’t know much about gpu processing, but I suspect that there are some significant cpu optimization opportunities left on the table here.

I see, I will only write @inbounds on the two outer loops then. With regards to @simd I will try removing it and sees what happens - but you are probably right that the conditionals have a huge effect.

Regarding the conditionals. The first conditional of “i != j” is due to the fact that I want to avoid the contribution of particle i on particle i, i.e. on it self. In theory I do not need the condition, but due to floating point math I expereince that rij = NaN, which of course ruins the whole simulation. When I was vectorizing my code instead of using a for loop, I was able to avoid this without a condition since I just set that “q[i] = 0”, so no errors occured.

The second conditional is due to the fact that I only want the nearest neighbours to contribute to a particle, since the math I am using only allows for a specific set of neighbours to interact with a particle, as for example seen in this picture below:

Source: https://www.researchgate.net/publication/324097218_A_Symmetric_Particle-Based_Simulation_Scheme_towards_Large_Scale_Diffuse_Fluids/figures?lo=1&utm_source=google&utm_medium=organic

The way to avoid this conditional would then be to for example use a Neighborsearch algorithm and I have already found a package using “Rangesearch” which I will try to apply to my problem. So this conditional it is possible to avoid. One could argue that using Neighboursearch would be slower than brute force, but this is not the case, since I can see a 10x performance boost, if I limit the inner loop for from 1:50 just as a test.

And actually using a Neighbour search might be the best improvement, since then in theory both conditionals should be removed. Hoping I can make it work on tuples else I will have to figure something out.

I will improve the CPU code, but still if anyone has some knowledge / suggestions for a GPU approach, I still would like to hear.

EDIT: With regards to @inbounds and @simd you were also right.

Kind regards

Hi

I’m also still trying to ramp up to using GPUs, but based on my experience on trying to optimise on CPUs some observations.
Have you studied https://docs.julialang.org/en/v1/manual/performance-tips/ ?

  • H, H1, AD, FAC etc. seem to be globals?
  • In my experience the biggest thing to kill performance is memory allocations. You seem to be doing a lot of allocations, granted many are just aliases and others seem to be scalar value, neither of which is probably that big of an issue, but some might be vectors or larger I’m not sure. Try to allocate beforehand and modify them in place if that is at all possible
  • You mention the data type is Array{Tuple{Float64,Float64,Float64},2}, but what about just making it an 3D array? or even 2D? Since you seem to only be using the array currently as a 1D object? (Is this maybe a Matlab legacy thing?) Or maybe an array of StaticArrays instead of the tuple?

I would guess these same things would be issues on the GPU, where you’re more memory constrained.

Also, this part can be improved (hard to tell of the impact, it should be profiled of course). norm calculates square root, which is unnecessary if you wouldn’t go into if branch.

So, you can rewrite it as

# somewhere at the beginning of the code
sqnorm(x) = sum(x.^2)
H2 = 2*H^2
Hinv = H^-1

...
# in PackStep function 
rij = p_i .- p_j
RIJ = sqnorm(rij)
if RIJ <= H2
  RIJ = sqrt(RIJ)
  q = RIJ * Hinv
  RIJ1 = 1.0 / RIJ
....

Of course, proper implementation of sqnorm should have inbounds and simd too. It may be even should be something like

function sqnorm(x, y)
  r = (x[1] - y[1])^2
  n = length(x)
  @inbounds @simd for i in 2:n
      r += (x[i] - y[i])^2
  end
  return r
end

and you can get rid of intermediate rij calculation.

P.S.: Sorry, it doesn’t answer your question though.

Hello, thank you very much for your comments.

Yes I have read the performance tips - if you have a specific glaring thing in my code, please do let me know, I really want to write it as good as possible.

  1. Yes H1, AD etc. are in fact globals defined as:
#---------------------------- Some Constants ------------------------------------#
# Set some constants
const H = 0.04
const A0 = 50
const AD = 348.15;
const FAC = 5/8;
const BETA = 4;
const ZETA = 0.060006;
const V    = 0.0011109;
const DT   = 0.016665;
const H1   = 1/H;
  1. In the loop I am making everything is scalars or tuples, so for example p_i is a tuple, while q would be a scalar etc. I am not sure how to improve that part.

  2. The reason I am using tuples are that they are locked in size and give me very good performance. The trouble is though that tuple’s do not allow for in place operations. StaticArrays would have the same problem I assume, since they are static?

Kind regards

It had a very big impact, thank you so much! Almost 4x times the performance. I did not know that square root would have such big an impact, but it makes sense now when I think about it. The 4x impact was with your basic implementation. Testing your proper one is not possible in current form, since I actually need to the rij in here at x_ij and z_ij:

for j = 1:nTot
            if (i != j)
                p_j = pg[j];
                rij = p_i .- p_j;
                RIJ = sqnorm(rij);
                
                if (RIJ <= H2)
                    RIJ = sqrt(RIJ)
                    RIJ1= 1.0 / RIJ;
                    q   = RIJ * H1
                    qq3 = q*(q-2)^3;
                    Wq  = AD * FAC * qq3;

                    x_ij = rij[1];
                    z_ij = rij[3];

                    Wgx += Wq * (x_ij * RIJ1) * H1;
                    Wgz += Wq * (z_ij * RIJ1) * H1;
                end
            end
        end

But thanks for the 4x performance already.

Kind regards

1 Like

Here’s an example on how to get rid of a conditional (I’m assuming q is always positive):

q = min(RIJ/H, 2) 
qq3 = q*(q-2)^3;
...

Then if q <= 2 is false, qq3 becomes zero, and the rest of the lines of code have no effect. The branch is gone. Of course, if q < 2 is false most of the time, then this isn’t very smart, but it could be a step on the way to get the code to vectorize.

The if i != j branch might also be possible to remove by similar means, making sure the code has no effect when i and j are the same.

Actually, it’s better to write

sqnorm(x) = sum(abs2, x)

because then you avoid creating a temporary array.

Edit: It seems like this is for a tuple, in which case it doesn’t make a difference, but for a generic function sum(abs2, x) is preferable.

3 Likes

Hey again

Yes, q is always positive. I have used your function for sqnorm and it is true what you say, for tuple it doesn’t do a difference.

I have tried to use your way of avoiding the q <= 2, condition but it seems to slow down the code instead of speeding it up. I am also not able to make it give the correct result, i.e. particles are behaving very weirdly, presumeable because I did not explain q too well. So here goes:


src: Smoothed-particle hydrodynamics - Wikipedia

s is often 2, and one defines q = norm(r_ij) / h, so the condition for allowing a particle j to interact with particle i, is that the normalized distance should be 2 or less. Using @Skoffer 's suggestions the condition instead changes to that norm(r_ij)^2 <= 4*H^2 - I have confirmed this produces the same result.

But then when I write what you suggest, for some reason, it is giving incorrect results. So to remove the i != j conditional I simply made a double condition in the inner-most if statement.

For now the end result has become:

function PackStep(pg,pg_tmp,u,u_tmp,nCoords,nTot)
    @fastmath @inbounds @threads for i = 1:nCoords
        Wgx = 0.0;
        Wgz = 0.0;
        p_i   = pg[i];

        for j = 1:nTot
                p_j = pg[j];
                rij = p_i .- p_j;
                RIJ = sqnorm(rij);

                if (RIJ > 0 && RIJ <= H2)
                    RIJ = sqrt(RIJ)
                    RIJ1= 1.0 / RIJ;
                    q   = RIJ*H1;
                    qq3 = q*(q-2)^3;
                    Wq  = AD * FAC * qq3;

                    x_ij = rij[1];
                    z_ij = rij[3];

                    Wgx += Wq * (x_ij * RIJ1) * H1;
                    Wgz += Wq * (z_ij * RIJ1) * H1;
                end
        end

        # Define here since I use it a lot
        u_i = u[i];

        dux = (-BETA * Wgx * V - ZETA * u_i[1])*DT;
        duz = (-BETA * Wgz * V - ZETA * u_i[3])*DT;

        dx  = dux*DT;
        dz  = duz*DT;

        u_tmp[i]  =   u_i   .+ (dux, 0.0, duz)
        pg_tmp[i] =   pg[i] .+ (dx,  0.0, dz)
    end

    @inbounds for i = 1:nCoords
        u[i]  = u_tmp[i];
        pg[i] = pg_tmp[i];
    end

end

And the inner-most if statement does not change performance, but will make it easier to remove completely when a neighbour search is included.

For now brute forcing is fine, since I only have 3000 particles, but someday I might have 100k, so it is nice to have it robust.

I know initially my question was about GPU programming, but I am surprised how much you could help me squeeze out of my poor cpu :wink:

Next steps will be to figure out if I can replace tuple’s with something more efficient, which also allows for inplace operation.

Kind regards, thanks for the help all

One more thing, which may be useful (but it wouldn’t give much since it’s O(n) and your bottleneck is O(n^2)), you can remove

    @inbounds for i = 1:nCoords
        u[i]  = u_tmp[i];
        pg[i] = pg_tmp[i];
    end

and instead do binding outside of PackStep function, something like this:

for _ in 1:number_of_steps
  PackStep(pg, pg_tmp, u, u_tmp, nCoords, nTot)
  pg, pg_tmp = pg_tmp, pg
  u, u_tmp = u_tmp, u
end

In this case it will be binding, not copying, so you can squeeze few additional nanoseconds.

P.S.: for nearest neighbours may be this thread will be relevant Cell list algorithm is slower than double for loop

Doing your suggestion ended up taking more memory, so I have left it for now.

And thank you very much for that link! Will have a look later, no need to code it if someone already has an implementation.

Kind regards

Hi

Sorry, it seems most of my suggestions was wrong. In the meantime I saw the other thread on this topic as well. Been playing with the code from that side.

  • I now agree that the tuples approach seems very good. I believe StaticArrays.jl is a tuple under the hood, potentially with just a bit more boilerplate to use them in vector contexts.
  • calling @time on these last versions of the PackStep (based on the C++ looping concept) the allocations are 0, once the global on Wgx and Wgz is removed and they are initialized as 0.0 as in this version. It seems that with these scalars and tuples the compiler is able to handle all the temporary variables without doing allocations.
  • I guess there is a theoretical very small improvement available by defining a const BETA_V_DT and doing that multiplication once-off, similar for ZETA_DT, but on the quick tests I did it wasn’t measurable.
  • I suspect for the GPU case, if tuples are not available you might again need to consider going back to a 2D array (3xN), but then temporary variables are more of a concern and you should try to reuse variables to prevent new allocations.

Just be careful, if somehow 2 particles happen to get onto the same position, the will not repulse each other, since RIJ is 0, maybe more of an issue when you switch to Float32 on the GPU than running on Float64. But the alternative generates NaN values, which also cause problems…

Here is a version that runs on 3xN arrays and gets almost the same performance as the tuple version (there might be a problem though, since I get some particles which escape the square)

function PackStep5(pg,u,nCoords)
    totParticples = size(pg,2)
    p_i = zeros(3)
    p_j = zeros(3)
    rij = zeros(3)
    for i = 1:nCoords
        Wgx = 0.0
        Wgz = 0.0

        p_i[1] = pg[1,i]
        p_i[3] = pg[3,i]

        for j = 1:totParticples
            if (i != j)
#                Should be able to use something like p_j[1:3] .= @view pg[1:3,j], but the view allocates and makes it very slow
                p_j[1] = pg[1,j]
                p_j[3] = pg[3,j]
                rij .= p_i .- p_j
                RIJ = sum(abs2,rij)
                if (0 < RIJ <= H2_4)
                    RIJ = sqrt(RIJ)
                    q   = RIJ/H
                    qq3 = q*(q-2)^3
                    Wq  = AD * FAC * qq3
                    
                    Wgx += Wq * (rij[1]/RIJ) * H1
                    Wgz += Wq * (rij[3]/RIJ) * H1
                end
            end
        end

        dux  = (-BETA * Wgx * V - ZETA * u[1,i])*DT
        duz  = (-BETA * Wgz * V - ZETA * u[3,i])*DT

        u[1,i]  += dux
        u[3,i]  += duz
        pg[1,i] += dux*DT
        pg[3,i] += duz*DT
    end
end

No problem! Thank you very much for taking time to show me how you would do it with arrays - I was actually just sitting with it and so frustrated that I kept getting 18 million allocations, but thank God, you showed me that I should actually preallocate p_i, p_j etc. The performance I get is now, for ONE time step:

0.766872 seconds (18.33 M allocations: 977.844 MiB, 24.68% gc time)
0.158227 seconds (75.20 k allocations: 3.087 MiB)

Where the upper line is where I tested your @view comment, you are right, it is incredible how much it actually changes, just with one line! I guess that is supposed to happen though…? The bottom line is with your correct command, just below. If I run your code with 100 time steps as usual I get:

6.615716 seconds (75.50 k allocations: 3.173 MiB)

Which shows that the allocations stay the same which is quite nice. The nice thing about your approach is that once again I can do it in a way so that I do not need a temporary array etc. (even though that would be mathematically more correct).

Unfortunately this code is about 15 times slower than the optimized tuple code… But with this I atleast have a way to implement it using GPU.

So once again thank you very much, your code snippet really helped me understand how to do it properly.

Kind regards

As a further test I tried to include these commands as before, but they break down the simulation and increase the time drastically. This might be due to some property of arrays I am not aware off.

For me tuple’s are more and more convincing to use, even if I have to use a tad more memory as shown below:

0.658729 seconds (225.24 k allocations: 11.669 MiB)

Kind regards, and thanks all

Have you tried using a Vector of SVectors from the StaticArrays.jl package?

1 Like

I have played around with it a tiny bit, but it seems to me that under the hood it is also “just” a tuple? I am having trouble replacing elements in an SVector properly specifically, and afaik in-place operations are not possible on them either…

Kind regards

SVectors are immutable. You can “set elements” using setindex, which creates a new SVector:

julia> using StaticArrays

julia> v = SA[1, 2, 3]
setve3-element SArray{Tuple{3},Int64,1,3} with indices SOneTo(3):
 1
 2
 3

julia> setindex(v, 10, 2)
3-element SArray{Tuple{3},Int64,1,3} with indices SOneTo(3):
  1
 10
  3

But as far as I could see from briefly looking at your algorithm, most of what you need are vector operations anyway?

2 Likes