How should I approach GPU programming?

dpsanders · April 17, 2020, 5:27am

By the way, could you please post a complete MWE (minimal example) that we can just copy and paste?

lwabeke · April 17, 2020, 8:27am

From my understanding:

p_j .= pg[:,j]

first creates a new 3 element vector (with its malloc), which then in-place assigned gets to the existing p_j

p_j .= @view pg[:,j]

first creates a new view object (with its small malloc, containing size and ptr), which then in-place assigned gets to the existing p_j

For large matrixes, the view version is much more efficient, since it only contains the ptr and size members, compared to the non view version which copies all elements, unfortunately for this case the 3 elements and the view is both similar in size and because it is in the inner loop, it becomes a problem.

I was hoping with some trick like explicitly specifying the size

p_j[1:3] .= pg[1:3,j]

the compiler would be smart enough to realise it can unroll the loop and skip the interim allocation, similar to the tuple case. (Anyone know if there is some such trick?)
Thus I was forced to either manually write the assignment loop with a for (in which case I have sometimes struggled with the loop variable becoming a new assignment) or do it explicitly as I did, since you only worry about the first and last element.

The 75.5k allocations, still sounds like your test case had some assignment happening every time. Or was that during the file reading and not assignments in PackStep. When I tested, I only had the 3 assignments:

    p_i = zeros(3)
    p_j = zeros(3)
    rij = zeros(3)

happening in each PackStep call. You can even get rid of them, by passing them in, but I would only really consider that since it makes using functions clumsy if the temp variables are big or if the function execution time is less than a 10ms or so. I still wish there was a way to tell Julia this function is only allowed to use stack and not dynamic memory allocations.

In terms of a minimum example, the code in the other thread helps, but doesn’t work due to needing to load a file. I replaced that with:

    coords = randn(1000,2);
    clamp!(coords, -1.0, 1.0)

but it means that if call the multiple times, you cannot compare final outputs exactly.

I still don’t understand why the vector approach should be so much slower, especially if @inbounds is specified. I thought on my side, I got them to within 10% (but now that I think about it, the vector approach used the conditional sqrt, while the tuple used sqrt every call, so maybe the vector version is significantly slower). But it probably makes some sense, for the tuple or SVector, the size is fixed and can be handled explicitly, while to the vector, vector operations need to cater for variable length. But still the code is now mostly scalar operations, so I didn’t expect vectors to be so bad (15 times?).

@inbounds should not break the simulation, unless there is a big problem in the code.
@fastmath might give different results due to different rounding if sequence of calculations are swapped. If that is a problem then I would be careful of trying to go to a GPU where the calculations are done by a different engine and where ArrayFire or OpenCL abstraction layers inbetween might cause different sequence of calculations.
@threads should ideally also not break the simulation, however I can think due to race conditions it is more likely to case weirdness if the code isn’t structured carefully.

Ahmed_Salih · April 17, 2020, 9:05am

MINIMAL WORKING EXAMPLE:

EDIT: Hard to MWE in my case, but I would like help explicitly with optimizing “PackStep()”

On my Github repository I have put the files:

“CircleUniformGrid.txt” - It holds the initial points for the circular distribution and then I draw the square profile programmatically around it using “drawSquare”. It is important for me to use this file, since this is what keeps my benchmarks consistent. The text file is just two coloumns with points hope it is not too much.
“particle_SA.jl” - My implementation using StaticArrays
“particle_tuple.jl” - My implementation using Tuple

Notet that since last time, a neighbour search algorithm has been included, which also speeds up performance. Here is the link:

Note; what I am timing on now is ONLY this loop:

@time for t = 1:100
        if mod(t,8) == 0
            balltree = BallTree(pg,reorder = false)
            idxs .= inrange(balltree, pg, R, false)
        end
        PackStep(pg,pg_tmp,u,u_tmp,nCoords,nTot,idxs)
    end

Since the way I am making StaticArrays initially still is a bit bad, see read2DParticleGroup. My current timings are:

StaticArrays: 0.147541 seconds (366.36 k allocations: 29.606 MiB)
tuple: 0.124091 seconds (327.98 k allocations: 30.546 MiB)

Where I have for both used the lowest number they would return in time.

@dpsanders

Thanks for showing me setindex functionality - yes, one can do it with vector operations, but I have chosen for loop approach since the way I am coding it I get much better performance using a for loop compared to vector notation - obviously I might do some bad stuff using dot notation, since I only know the “Matlab way” in this regards. The good news is I managed to get static arrays working, and they are only about 5-10 % slower than using tuple’s.

@lwabeke

Thanks for the explanation about views and your thought process behind it and some more information about the allocs. Yes at the time, when I used naive arrays in Julia I experienced that much worse performance, since for some reason I couldn’t multithread it - I agree it should have worked, maybe I did some mistake there.

It is not important that the result is 100% the same all the time, it should just not vary too much.

Thanks for your help.

Kind regards

Skoffer · April 17, 2020, 9:33am

In my experience I’ve seen huge number of allocations with @spawn and I suppose @threads macros. It looks like they generate lots of throwaway structures. Can you try and run this code without @threads? It’ll be slower of course, but show you how much you really allocate.

Ahmed_Salih · April 17, 2020, 9:38am

Okay, testing that now - again I am only timing the for loop which runs the main algorithm, in the function “RunAlgorithm()”.

For tuple:

0.077524 seconds (175.61 k allocations: 22.101 MiB)

For static arrays:

0.127115 seconds (321.97 k allocations: 27.006 MiB)

And I kept everything, I just removed @threads as you asked. For some reason tuple became even faster, I assume the number of particles for now are too small to benefit from multiple threads…

EDIT: But StaticArrays did not become faster, so maybe I have something quite wrong in my usage with StaticArrays.

Kind regards

Ahmed_Salih · April 17, 2020, 10:16am

Another thing I have found is that most of the allocations happen, at the calculation of idxs:

@time for t = 1:100
        if mod(t,8) == 0
            pg_arr .= reshape(collect(Iterators.flatten(pg)),(3,nTot))
            balltree = BallTree(pg_arr,reorder = false)
            @time idxs .= inrange(balltree, pg_arr, R, false)
        end
        PackStep(pg,pg_tmp,u,u_tmp,nCoords,nTot,idxs)
    end

Which results in;

0.003922 seconds (14.98 k allocations: 1.751 MiB)
0.003879 seconds (14.90 k allocations: 1.729 MiB)
0.003807 seconds (14.81 k allocations: 1.706 MiB)
0.003827 seconds (14.71 k allocations: 1.679 MiB)
0.003668 seconds (14.62 k allocations: 1.655 MiB)
0.003727 seconds (14.55 k allocations: 1.636 MiB)
0.003757 seconds (14.50 k allocations: 1.624 MiB)
0.003770 seconds (14.48 k allocations: 1.616 MiB)
0.003686 seconds (14.47 k allocations: 1.609 MiB)
0.003686 seconds (14.44 k allocations: 1.601 MiB)
0.003835 seconds (14.39 k allocations: 1.588 MiB)
0.003655 seconds (14.37 k allocations: 1.581 MiB)
0.108574 seconds (175.98 k allocations: 22.126 MiB)

So each time the neighbour search is run it allocates about 14k times. I thought my “.=” would make sure to replace “in place”, but I guess since that idxs is Array{Array{Int64,1},1}, and the subarrays don’t necesarily have the same number of elements each time, it has to reallocate.

Maybe there is no way around this?

Kind regards

Skoffer · April 17, 2020, 10:25am

.= wouldn’t help you here, since allocations happens inside inrange function: NearestNeighbors.jl/inrange.jl at master · KristofferC/NearestNeighbors.jl · GitHub

So, if you do not want to allocate, one needs to rewrite inrange in mutable form, i.e. something like that (untested)

import NearestNeighbors: check_input, check_radius, inrange_point!

function inrange!(idxs, tree::NNTree,
                 points::Vector{T},
                 radius::Number,
                 sortres=false) where {T <: AbstractVector}
    check_input(tree, points)
    check_radius(radius)

    for i in 1:length(points)
        empty!(idxs[i])
    end

    for i in 1:length(points)
        inrange_point!(tree, points[i], radius, sortres, idxs[i])
    end
    return idxs
end

and in your function

@time for t = 1:100
        if mod(t,8) == 0
            pg_arr .= reshape(collect(Iterators.flatten(pg)),(3,nTot))
            balltree = BallTree(pg_arr,reorder = false)
            @time inrange!(idxs, balltree, pg_arr, R, false)
        end
        PackStep(pg,pg_tmp,u,u_tmp,nCoords,nTot,idxs)
    end

But it is not obvious whether it give performance boost.

Ahmed_Salih · April 17, 2020, 12:16pm

Ah true, thanks for making me aware of it! I forgot how easy it is to actually use Julia to alter functions from packages. I stuck to my tuple workflow in this case, and altered functions as such:

import NearestNeighbors: check_input, check_radius, inrange_point!

function inrange(tree::NNTree,
                 points::Array{Tuple,2},
                 radius::Float64,
                 sortres=false)

    idxs = [Vector{Int}() for _ in 1:length(points)]

    for i in 1:length(points)
        inrange_point!(tree, collect(points[i]), radius, sortres, idxs[i])
    end
    return idxs
end

function inrange!(idxs::Array{Array{Int64,1}},
                 tree::NNTree,
                 points::Array{Tuple,2},
                 radius::Number,
                 sortres=false) where {T <: AbstractVector}

    for i in 1:length(points)
        empty!(idxs[i])
    end
    for i in 1:length(points)
        inrange_point!(tree, collect(points[i]), radius, sortres, idxs[i])
    end
    return nothing
end

Which results in:

0.078800 seconds (39.36 k allocations: 6.499 MiB)

So no apparent performance loss and a lot better allocations compared to post number 24.

I managed to change the functions so they can work on my tuple’s directly so I am pretty happy.

EDIT: Considering to change BallTree as well, will probably take a bit.

Kind regards

Topic		Replies	Views
Problem with GPU programming GPU cudanative , cuda	4	1055	September 13, 2019
[blog post] Introduction to GPU programming Community gpu , cudanative , gpuarrays , blog-post	15	3271	December 20, 2018
Base function in Cuda kernels General Usage cudanative , cuda	8	3193	March 15, 2019
How to get started with GPU programming? OpenCL or CUDA? GPU	7	7222	August 29, 2017
One example from `GPU programming in Julia \| Workshop \| JuliaCon 2021` GPU question , gpu	0	361	April 5, 2022

How should I approach GPU programming?

Related topics