How should I approach GPU programming?

Hello, thank you very much for your comments.

Yes I have read the performance tips - if you have a specific glaring thing in my code, please do let me know, I really want to write it as good as possible.

  1. Yes H1, AD etc. are in fact globals defined as:
#---------------------------- Some Constants ------------------------------------#
# Set some constants
const H = 0.04
const A0 = 50
const AD = 348.15;
const FAC = 5/8;
const BETA = 4;
const ZETA = 0.060006;
const V    = 0.0011109;
const DT   = 0.016665;
const H1   = 1/H;
  1. In the loop I am making everything is scalars or tuples, so for example p_i is a tuple, while q would be a scalar etc. I am not sure how to improve that part.

  2. The reason I am using tuples are that they are locked in size and give me very good performance. The trouble is though that tuple’s do not allow for in place operations. StaticArrays would have the same problem I assume, since they are static?

Kind regards

It had a very big impact, thank you so much! Almost 4x times the performance. I did not know that square root would have such big an impact, but it makes sense now when I think about it. The 4x impact was with your basic implementation. Testing your proper one is not possible in current form, since I actually need to the rij in here at x_ij and z_ij:

for j = 1:nTot
            if (i != j)
                p_j = pg[j];
                rij = p_i .- p_j;
                RIJ = sqnorm(rij);
                
                if (RIJ <= H2)
                    RIJ = sqrt(RIJ)
                    RIJ1= 1.0 / RIJ;
                    q   = RIJ * H1
                    qq3 = q*(q-2)^3;
                    Wq  = AD * FAC * qq3;

                    x_ij = rij[1];
                    z_ij = rij[3];

                    Wgx += Wq * (x_ij * RIJ1) * H1;
                    Wgz += Wq * (z_ij * RIJ1) * H1;
                end
            end
        end

But thanks for the 4x performance already.

Kind regards

1 Like

Here’s an example on how to get rid of a conditional (I’m assuming q is always positive):

q = min(RIJ/H, 2) 
qq3 = q*(q-2)^3;
...

Then if q <= 2 is false, qq3 becomes zero, and the rest of the lines of code have no effect. The branch is gone. Of course, if q < 2 is false most of the time, then this isn’t very smart, but it could be a step on the way to get the code to vectorize.

The if i != j branch might also be possible to remove by similar means, making sure the code has no effect when i and j are the same.

Actually, it’s better to write

sqnorm(x) = sum(abs2, x)

because then you avoid creating a temporary array.

Edit: It seems like this is for a tuple, in which case it doesn’t make a difference, but for a generic function sum(abs2, x) is preferable.

3 Likes

Hey again

Yes, q is always positive. I have used your function for sqnorm and it is true what you say, for tuple it doesn’t do a difference.

I have tried to use your way of avoiding the q <= 2, condition but it seems to slow down the code instead of speeding it up. I am also not able to make it give the correct result, i.e. particles are behaving very weirdly, presumeable because I did not explain q too well. So here goes:


src: https://en.wikipedia.org/wiki/Smoothed-particle_hydrodynamics

s is often 2, and one defines q = norm(r_ij) / h, so the condition for allowing a particle j to interact with particle i, is that the normalized distance should be 2 or less. Using @Skoffer 's suggestions the condition instead changes to that norm(r_ij)^2 <= 4*H^2 - I have confirmed this produces the same result.

But then when I write what you suggest, for some reason, it is giving incorrect results. So to remove the i != j conditional I simply made a double condition in the inner-most if statement.

For now the end result has become:

function PackStep(pg,pg_tmp,u,u_tmp,nCoords,nTot)
    @fastmath @inbounds @threads for i = 1:nCoords
        Wgx = 0.0;
        Wgz = 0.0;
        p_i   = pg[i];

        for j = 1:nTot
                p_j = pg[j];
                rij = p_i .- p_j;
                RIJ = sqnorm(rij);

                if (RIJ > 0 && RIJ <= H2)
                    RIJ = sqrt(RIJ)
                    RIJ1= 1.0 / RIJ;
                    q   = RIJ*H1;
                    qq3 = q*(q-2)^3;
                    Wq  = AD * FAC * qq3;

                    x_ij = rij[1];
                    z_ij = rij[3];

                    Wgx += Wq * (x_ij * RIJ1) * H1;
                    Wgz += Wq * (z_ij * RIJ1) * H1;
                end
        end

        # Define here since I use it a lot
        u_i = u[i];

        dux = (-BETA * Wgx * V - ZETA * u_i[1])*DT;
        duz = (-BETA * Wgz * V - ZETA * u_i[3])*DT;

        dx  = dux*DT;
        dz  = duz*DT;

        u_tmp[i]  =   u_i   .+ (dux, 0.0, duz)
        pg_tmp[i] =   pg[i] .+ (dx,  0.0, dz)
    end

    @inbounds for i = 1:nCoords
        u[i]  = u_tmp[i];
        pg[i] = pg_tmp[i];
    end

end

And the inner-most if statement does not change performance, but will make it easier to remove completely when a neighbour search is included.

For now brute forcing is fine, since I only have 3000 particles, but someday I might have 100k, so it is nice to have it robust.

I know initially my question was about GPU programming, but I am surprised how much you could help me squeeze out of my poor cpu :wink:

Next steps will be to figure out if I can replace tuple’s with something more efficient, which also allows for inplace operation.

Kind regards, thanks for the help all

One more thing, which may be useful (but it wouldn’t give much since it’s O(n) and your bottleneck is O(n^2)), you can remove

    @inbounds for i = 1:nCoords
        u[i]  = u_tmp[i];
        pg[i] = pg_tmp[i];
    end

and instead do binding outside of PackStep function, something like this:

for _ in 1:number_of_steps
  PackStep(pg, pg_tmp, u, u_tmp, nCoords, nTot)
  pg, pg_tmp = pg_tmp, pg
  u, u_tmp = u_tmp, u
end

In this case it will be binding, not copying, so you can squeeze few additional nanoseconds.

P.S.: for nearest neighbours may be this thread will be relevant Cell list algorithm is slower than double for loop

Doing your suggestion ended up taking more memory, so I have left it for now.

And thank you very much for that link! Will have a look later, no need to code it if someone already has an implementation.

Kind regards

Hi

Sorry, it seems most of my suggestions was wrong. In the meantime I saw the other thread on this topic as well. Been playing with the code from that side.

  • I now agree that the tuples approach seems very good. I believe StaticArrays.jl is a tuple under the hood, potentially with just a bit more boilerplate to use them in vector contexts.
  • calling @time on these last versions of the PackStep (based on the C++ looping concept) the allocations are 0, once the global on Wgx and Wgz is removed and they are initialized as 0.0 as in this version. It seems that with these scalars and tuples the compiler is able to handle all the temporary variables without doing allocations.
  • I guess there is a theoretical very small improvement available by defining a const BETA_V_DT and doing that multiplication once-off, similar for ZETA_DT, but on the quick tests I did it wasn’t measurable.
  • I suspect for the GPU case, if tuples are not available you might again need to consider going back to a 2D array (3xN), but then temporary variables are more of a concern and you should try to reuse variables to prevent new allocations.

Just be careful, if somehow 2 particles happen to get onto the same position, the will not repulse each other, since RIJ is 0, maybe more of an issue when you switch to Float32 on the GPU than running on Float64. But the alternative generates NaN values, which also cause problems…

Here is a version that runs on 3xN arrays and gets almost the same performance as the tuple version (there might be a problem though, since I get some particles which escape the square)

function PackStep5(pg,u,nCoords)
    totParticples = size(pg,2)
    p_i = zeros(3)
    p_j = zeros(3)
    rij = zeros(3)
    for i = 1:nCoords
        Wgx = 0.0
        Wgz = 0.0

        p_i[1] = pg[1,i]
        p_i[3] = pg[3,i]

        for j = 1:totParticples
            if (i != j)
#                Should be able to use something like p_j[1:3] .= @view pg[1:3,j], but the view allocates and makes it very slow
                p_j[1] = pg[1,j]
                p_j[3] = pg[3,j]
                rij .= p_i .- p_j
                RIJ = sum(abs2,rij)
                if (0 < RIJ <= H2_4)
                    RIJ = sqrt(RIJ)
                    q   = RIJ/H
                    qq3 = q*(q-2)^3
                    Wq  = AD * FAC * qq3
                    
                    Wgx += Wq * (rij[1]/RIJ) * H1
                    Wgz += Wq * (rij[3]/RIJ) * H1
                end
            end
        end

        dux  = (-BETA * Wgx * V - ZETA * u[1,i])*DT
        duz  = (-BETA * Wgz * V - ZETA * u[3,i])*DT

        u[1,i]  += dux
        u[3,i]  += duz
        pg[1,i] += dux*DT
        pg[3,i] += duz*DT
    end
end

No problem! Thank you very much for taking time to show me how you would do it with arrays - I was actually just sitting with it and so frustrated that I kept getting 18 million allocations, but thank God, you showed me that I should actually preallocate p_i, p_j etc. The performance I get is now, for ONE time step:

0.766872 seconds (18.33 M allocations: 977.844 MiB, 24.68% gc time)
0.158227 seconds (75.20 k allocations: 3.087 MiB)

Where the upper line is where I tested your @view comment, you are right, it is incredible how much it actually changes, just with one line! I guess that is supposed to happen though…? The bottom line is with your correct command, just below. If I run your code with 100 time steps as usual I get:

6.615716 seconds (75.50 k allocations: 3.173 MiB)

Which shows that the allocations stay the same which is quite nice. The nice thing about your approach is that once again I can do it in a way so that I do not need a temporary array etc. (even though that would be mathematically more correct).

Unfortunately this code is about 15 times slower than the optimized tuple code… But with this I atleast have a way to implement it using GPU.

So once again thank you very much, your code snippet really helped me understand how to do it properly.

Kind regards

As a further test I tried to include these commands as before, but they break down the simulation and increase the time drastically. This might be due to some property of arrays I am not aware off.

For me tuple’s are more and more convincing to use, even if I have to use a tad more memory as shown below:

0.658729 seconds (225.24 k allocations: 11.669 MiB)

Kind regards, and thanks all

Have you tried using a Vector of SVectors from the StaticArrays.jl package?

1 Like

I have played around with it a tiny bit, but it seems to me that under the hood it is also “just” a tuple? I am having trouble replacing elements in an SVector properly specifically, and afaik in-place operations are not possible on them either…

Kind regards

SVectors are immutable. You can “set elements” using setindex, which creates a new SVector:

julia> using StaticArrays

julia> v = SA[1, 2, 3]
setve3-element SArray{Tuple{3},Int64,1,3} with indices SOneTo(3):
 1
 2
 3

julia> setindex(v, 10, 2)
3-element SArray{Tuple{3},Int64,1,3} with indices SOneTo(3):
  1
 10
  3

But as far as I could see from briefly looking at your algorithm, most of what you need are vector operations anyway?

2 Likes

By the way, could you please post a complete MWE (minimal example) that we can just copy and paste?

2 Likes

From my understanding:

p_j .= pg[:,j]

first creates a new 3 element vector (with its malloc), which then in-place assigned gets to the existing p_j

p_j .= @view pg[:,j]

first creates a new view object (with its small malloc, containing size and ptr), which then in-place assigned gets to the existing p_j

For large matrixes, the view version is much more efficient, since it only contains the ptr and size members, compared to the non view version which copies all elements, unfortunately for this case the 3 elements and the view is both similar in size and because it is in the inner loop, it becomes a problem.

I was hoping with some trick like explicitly specifying the size

p_j[1:3] .= pg[1:3,j]

the compiler would be smart enough to realise it can unroll the loop and skip the interim allocation, similar to the tuple case. (Anyone know if there is some such trick?)
Thus I was forced to either manually write the assignment loop with a for (in which case I have sometimes struggled with the loop variable becoming a new assignment) or do it explicitly as I did, since you only worry about the first and last element.

The 75.5k allocations, still sounds like your test case had some assignment happening every time. Or was that during the file reading and not assignments in PackStep. When I tested, I only had the 3 assignments:

    p_i = zeros(3)
    p_j = zeros(3)
    rij = zeros(3)

happening in each PackStep call. You can even get rid of them, by passing them in, but I would only really consider that since it makes using functions clumsy if the temp variables are big or if the function execution time is less than a 10ms or so. I still wish there was a way to tell Julia this function is only allowed to use stack and not dynamic memory allocations.

In terms of a minimum example, the code in the other thread helps, but doesn’t work due to needing to load a file. I replaced that with:

    coords = randn(1000,2);
    clamp!(coords, -1.0, 1.0)

but it means that if call the multiple times, you cannot compare final outputs exactly.

I still don’t understand why the vector approach should be so much slower, especially if @inbounds is specified. I thought on my side, I got them to within 10% (but now that I think about it, the vector approach used the conditional sqrt, while the tuple used sqrt every call, so maybe the vector version is significantly slower). But it probably makes some sense, for the tuple or SVector, the size is fixed and can be handled explicitly, while to the vector, vector operations need to cater for variable length. But still the code is now mostly scalar operations, so I didn’t expect vectors to be so bad (15 times?).

@inbounds should not break the simulation, unless there is a big problem in the code.
@fastmath might give different results due to different rounding if sequence of calculations are swapped. If that is a problem then I would be careful of trying to go to a GPU where the calculations are done by a different engine and where ArrayFire or OpenCL abstraction layers inbetween might cause different sequence of calculations.
@threads should ideally also not break the simulation, however I can think due to race conditions it is more likely to case weirdness if the code isn’t structured carefully.

MINIMAL WORKING EXAMPLE:

EDIT: Hard to MWE in my case, but I would like help explicitly with optimizing “PackStep()”

On my Github repository I have put the files:

  1. “CircleUniformGrid.txt” - It holds the initial points for the circular distribution and then I draw the square profile programmatically around it using “drawSquare”. It is important for me to use this file, since this is what keeps my benchmarks consistent. The text file is just two coloumns with points hope it is not too much.
  2. “particle_SA.jl” - My implementation using StaticArrays
  3. “particle_tuple.jl” - My implementation using Tuple

Notet that since last time, a neighbour search algorithm has been included, which also speeds up performance. Here is the link:

Note; what I am timing on now is ONLY this loop:

@time for t = 1:100
        if mod(t,8) == 0
            balltree = BallTree(pg,reorder = false)
            idxs .= inrange(balltree, pg, R, false)
        end
        PackStep(pg,pg_tmp,u,u_tmp,nCoords,nTot,idxs)
    end

Since the way I am making StaticArrays initially still is a bit bad, see read2DParticleGroup. My current timings are:

StaticArrays: 0.147541 seconds (366.36 k allocations: 29.606 MiB)
tuple: 0.124091 seconds (327.98 k allocations: 30.546 MiB)

Where I have for both used the lowest number they would return in time.

@dpsanders

Thanks for showing me setindex functionality - yes, one can do it with vector operations, but I have chosen for loop approach since the way I am coding it I get much better performance using a for loop compared to vector notation - obviously I might do some bad stuff using dot notation, since I only know the “Matlab way” in this regards. The good news is I managed to get static arrays working, and they are only about 5-10 % slower than using tuple’s.

@lwabeke

Thanks for the explanation about views and your thought process behind it and some more information about the allocs. Yes at the time, when I used naive arrays in Julia I experienced that much worse performance, since for some reason I couldn’t multithread it - I agree it should have worked, maybe I did some mistake there.

It is not important that the result is 100% the same all the time, it should just not vary too much.

Thanks for your help.

Kind regards

In my experience I’ve seen huge number of allocations with @spawn and I suppose @threads macros. It looks like they generate lots of throwaway structures. Can you try and run this code without @threads? It’ll be slower of course, but show you how much you really allocate.

Okay, testing that now - again I am only timing the for loop which runs the main algorithm, in the function “RunAlgorithm()”.

For tuple:

0.077524 seconds (175.61 k allocations: 22.101 MiB)

For static arrays:

0.127115 seconds (321.97 k allocations: 27.006 MiB)

And I kept everything, I just removed @threads as you asked. For some reason tuple became even faster, I assume the number of particles for now are too small to benefit from multiple threads…

EDIT: But StaticArrays did not become faster, so maybe I have something quite wrong in my usage with StaticArrays.

Kind regards

1 Like

Another thing I have found is that most of the allocations happen, at the calculation of idxs:

@time for t = 1:100
        if mod(t,8) == 0
            pg_arr .= reshape(collect(Iterators.flatten(pg)),(3,nTot))
            balltree = BallTree(pg_arr,reorder = false)
            @time idxs .= inrange(balltree, pg_arr, R, false)
        end
        PackStep(pg,pg_tmp,u,u_tmp,nCoords,nTot,idxs)
    end

Which results in;

0.003922 seconds (14.98 k allocations: 1.751 MiB)
0.003879 seconds (14.90 k allocations: 1.729 MiB)
0.003807 seconds (14.81 k allocations: 1.706 MiB)
0.003827 seconds (14.71 k allocations: 1.679 MiB)
0.003668 seconds (14.62 k allocations: 1.655 MiB)
0.003727 seconds (14.55 k allocations: 1.636 MiB)
0.003757 seconds (14.50 k allocations: 1.624 MiB)
0.003770 seconds (14.48 k allocations: 1.616 MiB)
0.003686 seconds (14.47 k allocations: 1.609 MiB)
0.003686 seconds (14.44 k allocations: 1.601 MiB)
0.003835 seconds (14.39 k allocations: 1.588 MiB)
0.003655 seconds (14.37 k allocations: 1.581 MiB)
0.108574 seconds (175.98 k allocations: 22.126 MiB)

So each time the neighbour search is run it allocates about 14k times. I thought my “.=” would make sure to replace “in place”, but I guess since that idxs is Array{Array{Int64,1},1}, and the subarrays don’t necesarily have the same number of elements each time, it has to reallocate.

Maybe there is no way around this?

Kind regards

.= wouldn’t help you here, since allocations happens inside inrange function: https://github.com/KristofferC/NearestNeighbors.jl/blob/master/src/inrange.jl#L16

So, if you do not want to allocate, one needs to rewrite inrange in mutable form, i.e. something like that (untested)

import NearestNeighbors: check_input, check_radius, inrange_point!

function inrange!(idxs, tree::NNTree,
                 points::Vector{T},
                 radius::Number,
                 sortres=false) where {T <: AbstractVector}
    check_input(tree, points)
    check_radius(radius)

    for i in 1:length(points)
        empty!(idxs[i])
    end

    for i in 1:length(points)
        inrange_point!(tree, points[i], radius, sortres, idxs[i])
    end
    return idxs
end

and in your function

@time for t = 1:100
        if mod(t,8) == 0
            pg_arr .= reshape(collect(Iterators.flatten(pg)),(3,nTot))
            balltree = BallTree(pg_arr,reorder = false)
            @time inrange!(idxs, balltree, pg_arr, R, false)
        end
        PackStep(pg,pg_tmp,u,u_tmp,nCoords,nTot,idxs)
    end

But it is not obvious whether it give performance boost.

1 Like