How should I approach GPU programming?

One more thing, which may be useful (but it wouldn’t give much since it’s O(n) and your bottleneck is O(n^2)), you can remove

    @inbounds for i = 1:nCoords
        u[i]  = u_tmp[i];
        pg[i] = pg_tmp[i];
    end

and instead do binding outside of PackStep function, something like this:

for _ in 1:number_of_steps
  PackStep(pg, pg_tmp, u, u_tmp, nCoords, nTot)
  pg, pg_tmp = pg_tmp, pg
  u, u_tmp = u_tmp, u
end

In this case it will be binding, not copying, so you can squeeze few additional nanoseconds.

P.S.: for nearest neighbours may be this thread will be relevant Cell list algorithm is slower than double for loop