Ok. So I managed to run your example. Here are some comments and possible improvements:
1. Shape of the position and velocity arrays
As mentioned above, Julia is column-major. Thus, for the coordinates of each point to be contiguous in memory in a matrix, the matrix must have dimensions (3,N)
and not (N,3)
. Thus, consider changing your code to store the points in a matrix like created with:
points = rand(Float64,3,N)
instead of rand(Float64,N,3)
like you are doing. Another possibility (which is the one I use in the examples of my package) is to use a vector of static vectors. However, with the layout above, you con convert one into the other using:
new_points = reshape(reinterpret(SVector{3,Float64},points),size(points,2))
This will not copy the whole data to the vector of static vectors, as before.
By doing that, allocations are cut in half:
julia> @time test()
0.575401 seconds (1.34 M allocations: 153.026 MiB, 4.67% gc time)
5-element Vector{Float64}:
But that had nothing to do with the actual code, because the allocations were only associated to the conversion of the input vectors.
Here is the working code with that change only:
Code
using CellListMap
using StaticArrays
using LinearAlgebra
function get_periodic_difference_cell_lists( x, y, period)
delta = abs.(x - y)
return @. ifelse(delta > 0.5 * period, delta - period, delta )* sign(x-y)
end
function compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins,sides)
d = get_periodic_difference_cell_lists(x,y,sides)
r = LinearAlgebra.norm(d)
ibin = searchsortedfirst(rbins, r) - 1
hist[1][ibin] += 1
hist[2][ibin] += LinearAlgebra.dot(velocities[i] - velocities[j],d)/r
return hist
end
function get_pairwise_velocity_radial_mean_cell_lists(
positions, velocities,
rbins,
boxsize
)
n = size(positions)[2]
r_max = maximum(rbins)
lc = LinkedLists(n)
box = Box(boxsize, r_max)
positions = reshape(reinterpret(SVector{3,Float64},positions),n)
velocities = reshape(reinterpret(SVector{3,Float64},positions),n)
initlists!(positions,box,lc)
hist = (zeros(Int,length(rbins)-1), zeros(Float64,length(rbins)-1))
hist = map_pairwise!(
(x,y,i,j,d2,hist) -> compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins, boxsize),
hist, positions, box, lc,
)
n_pairs = hist[1]
mean_v_r = hist[2]
mean_v_r[n_pairs .> 0] = mean_v_r[n_pairs .> 0]./n_pairs[n_pairs .> 0]
return mean_v_r
end
function test()
n_halos = 100_000
boxsize = 250.
Lbox = [boxsize,boxsize,boxsize]
positions = boxsize .* rand(Float64, 3, n_halos)
velocities = rand(Float64, 3, n_halos)
rbins = [0.,2.,4.,6.,8.,10.]
r_max = maximum(rbins)
get_pairwise_velocity_radial_mean_cell_lists(
positions,
velocities,
rbins,
Lbox
)
end
2. Solve allocations in get_periodic_difference_cell_lists
That function is allocating, but not because abs.(x-y)
as I initially suspected, becuse x
and y
where already static arrays. The issue was the other line:
return @. ifelse(delta > 0.5 * period, delta - period, delta )* sign(x-y)
In your case this function is receiving two static vectors and one standard array (period
). Let us optimize it. It is now:
julia> using StaticArrays, BenchmarkTools
julia> function get_periodic_difference_cell_lists( x, y, period)
delta = abs.(x - y)
return @. ifelse(delta > 0.5 * period, delta - period, delta )* sign(x-y)
end
get_periodic_difference_cell_lists (generic function with 1 method)
julia> x = rand(SVector{3,Float64}); y = rand(SVector{3,Float64}); period = rand(3);
julia> @btime get_periodic_difference_cell_lists(a,b,c) setup=(a=$x;b=$y;c=$period) evals=1
73.000 ns (1 allocation: 112 bytes)
3-element Vector{Float64}:
-0.24223519703165342
0.13349194837831435
-0.1231010078244783
as you see, it allocates (see this). The most “trivial” way to do that non-allocating is just open that one-liner:
julia> function getdx(x,y,period)
delta = x - y
d = delta > 0.5*period ? delta - period : delta
return d*sign(x-y)
end
function get_periodic_difference_cell_lists( x, y, period)
return (getdx(x[1],y[1],period[1]),
getdx(x[2],y[2],period[2]),
getdx(x[3],y[3],period[3]))
end
get_periodic_difference_cell_lists (generic function with 1 method)
julia> @btime get_periodic_difference_cell_lists(a,b,c) setup=(a=$x;b=$y;c=$period) evals=1
32.000 ns (0 allocations: 0 bytes)
(0.05372893398902234, 0.17882655229010624, 0.8462357221293377)
Ok, good enough, but that looks ugly. You can use a more fancy syntax for that:
julia> function getdx(x,y,period)
delta = x - y
return (delta > 0.5*period ? delta - period : delta)*sign(x-y)
end
get_periodic_difference_cell_lists(x,y,period) =
SVector{3,Float64}(ntuple(i -> getdx(x[i],y[i],period[i]), 3))
julia> @btime get_periodic_difference_cell_lists(a,b,c) setup=(a=$x;b=$y;c=$period) evals=1
33.000 ns (0 allocations: 0 bytes)
I prefer returning a SVector
there, but that is optional (a tuple would do fine).
That will solve all important allocations, and the code becomes slightly faster:
julia> @time test()
0.451519 seconds (106 allocations: 9.925 MiB, 2.07% gc time)
5-element Vector{Float64}:
Code updated
using CellListMap
using StaticArrays
using LinearAlgebra
function getdx(x,y,period)
delta = x - y
return (delta > 0.5*period ? delta - period : delta)*sign(x-y)
end
get_periodic_difference_cell_lists(x,y,period) =
SVector{3,Float64}(ntuple(i -> getdx(x[i],y[i],period[i]), 3))
function compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins,sides)
d = get_periodic_difference_cell_lists(x,y,sides)
r = LinearAlgebra.norm(d)
ibin = searchsortedfirst(rbins, r) - 1
hist[1][ibin] += 1
hist[2][ibin] += LinearAlgebra.dot(velocities[i] - velocities[j],d)/r
return hist
end
function get_pairwise_velocity_radial_mean_cell_lists(
positions, velocities,
rbins,
boxsize
)
n = size(positions)[2]
r_max = maximum(rbins)
lc = LinkedLists(n)
box = Box(boxsize, r_max)
positions = reshape(reinterpret(SVector{3,Float64},positions),n)
velocities = reshape(reinterpret(SVector{3,Float64},positions),n)
initlists!(positions,box,lc)
hist = (zeros(Int,length(rbins)-1), zeros(Float64,length(rbins)-1))
hist = map_pairwise!(
(x,y,i,j,d2,hist) -> compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins, boxsize),
hist, positions, box, lc,
)
n_pairs = hist[1]
mean_v_r = hist[2]
mean_v_r[n_pairs .> 0] = mean_v_r[n_pairs .> 0]./n_pairs[n_pairs .> 0]
return mean_v_r
end
function test()
n_halos = 100_000
boxsize = 250.
Lbox = [boxsize,boxsize,boxsize]
positions = boxsize .* rand(Float64, 3, n_halos)
velocities = rand(Float64, 3, n_halos)
rbins = [0.,2.,4.,6.,8.,10.]
r_max = maximum(rbins)
get_pairwise_velocity_radial_mean_cell_lists(
positions,
velocities,
rbins,
Lbox
)
end
3. My fault
Besides that, there are some difficulties in for the compiler to infer the types of the Box
and LinkedLists
structures which are my fault. They are not affecting performance here, but they create some few more allocations. While I do not solve them in the package (I hope to do that soon), you can solve them by putting a function barrier, that is, creating box
and lc
outside the main function and passing them to the hard-core calculation as parameters. This results in:
julia> @time test()
0.450560 seconds (91 allocations: 9.925 MiB)
Code
using CellListMap
using StaticArrays
using LinearAlgebra
function getdx(x,y,period)
delta = x - y
return (delta > 0.5*period ? delta - period : delta)*sign(x-y)
end
get_periodic_difference_cell_lists(x,y,period) =
SVector{3,Float64}(ntuple(i -> getdx(x[i],y[i],period[i]), 3))
function compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins,sides)
d = get_periodic_difference_cell_lists(x,y,sides)
r = LinearAlgebra.norm(d)
ibin = searchsortedfirst(rbins, r) - 1
hist[1][ibin] += 1
hist[2][ibin] += LinearAlgebra.dot(velocities[i] - velocities[j],d)/r
return hist
end
function get_pairwise_velocity_radial_mean_cell_lists(
positions, velocities,
rbins,
boxsize,
lc, box, n
)
positions = reshape(reinterpret(SVector{3,Float64},positions),n)
velocities = reshape(reinterpret(SVector{3,Float64},positions),n)
initlists!(positions,box,lc)
hist = (zeros(Int,length(rbins)-1), zeros(Float64,length(rbins)-1))
hist = map_pairwise!(
(x,y,i,j,d2,hist) -> compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins, boxsize),
hist, positions, box, lc,
)
n_pairs = hist[1]
mean_v_r = hist[2]
mean_v_r[n_pairs .> 0] = mean_v_r[n_pairs .> 0]./n_pairs[n_pairs .> 0]
return mean_v_r
end
function test()
n_halos = 100_000
boxsize = 250.
Lbox = [boxsize,boxsize,boxsize]
positions = boxsize .* rand(Float64, 3, n_halos)
velocities = rand(Float64, 3, n_halos)
rbins = [0.,2.,4.,6.,8.,10.]
n = size(positions)[2]
r_max = maximum(rbins)
lc = LinkedLists(n)
box = Box(Lbox, r_max)
get_pairwise_velocity_radial_mean_cell_lists(
positions,
velocities,
rbins,
Lbox,
lc, box, n
)
end
4. Parallelization
Now, if you provide the proper reduce function, which in your case is:
function reduce_hist(hist,hist_threaded)
hist = hist_threaded[1]
for i in 2:Threads.nthreads()
│ hist[1] .+= hist_threaded[i][1]
│ hist[2] .+= hist_threaded[i][2]
end
return hist
end
You can run that in parallel. Here I have 4 cores (8 threads). And I get:
julia> @time test()
0.126351 seconds (283 allocations: 9.942 MiB)
Code
using CellListMap
using StaticArrays
using LinearAlgebra
function getdx(x,y,period)
delta = x - y
return (delta > 0.5*period ? delta - period : delta)*sign(x-y)
end
get_periodic_difference_cell_lists(x,y,period) =
SVector{3,Float64}(ntuple(i -> getdx(x[i],y[i],period[i]), 3))
function compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins,sides)
d = get_periodic_difference_cell_lists(x,y,sides)
r = LinearAlgebra.norm(d)
ibin = searchsortedfirst(rbins, r) - 1
hist[1][ibin] += 1
hist[2][ibin] += LinearAlgebra.dot(velocities[i] - velocities[j],d)/r
return hist
end
function reduce_hist(hist,hist_threaded)
hist = hist_threaded[1]
for i in 2:Threads.nthreads()
hist[1] .+= hist_threaded[i][1]
hist[2] .+= hist_threaded[i][2]
end
return hist
end
function get_pairwise_velocity_radial_mean_cell_lists(
positions, velocities,
rbins,
boxsize,
lc, box, n
)
positions = reshape(reinterpret(SVector{3,Float64},positions),n)
velocities = reshape(reinterpret(SVector{3,Float64},positions),n)
initlists!(positions,box,lc)
hist = (zeros(Int,length(rbins)-1), zeros(Float64,length(rbins)-1))
hist = map_pairwise!(
(x,y,i,j,d2,hist) ->
compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins, boxsize),
hist, positions, box, lc,
reduce=reduce_hist
)
n_pairs = hist[1]
mean_v_r = hist[2]
mean_v_r[n_pairs .> 0] = mean_v_r[n_pairs .> 0]./n_pairs[n_pairs .> 0]
return mean_v_r
end
function test()
n_halos = 100_000
boxsize = 250.
Lbox = [boxsize,boxsize,boxsize]
positions = boxsize .* rand(Float64, 3, n_halos)
velocities = rand(Float64, 3, n_halos)
rbins = [0.,2.,4.,6.,8.,10.]
n = size(positions)[2]
r_max = maximum(rbins)
lc = LinkedLists(n)
box = Box(Lbox, r_max)
get_pairwise_velocity_radial_mean_cell_lists(
positions,
velocities,
rbins,
Lbox,
lc, box, n
)
end
Note: all times reported are running @time test()
twice, as the first run will count the compilation time and allocations.
@florpi and @etienne_dg I tested the python/cython code here (hopefully correctly). Apparently it is still somewhat faster (single threaded) than my code for 100k atoms, and it is much faster (10s vs 2 minutes) for 1 million atoms. I am mostly interested to know what is the difference. Are you computing the same thing in both cases? (florpi is building an histogram of the velocities and returning it), to compute some average over the histgoram at the end. Even if there is some difference there, probably that is not the main cause of difference. Are you using more than one mesh to classify the points in space? I will try to study the code, but any insight will be helpful.
edit: I figured out what is the algorithmic difference. @florpi I will be improving that in the next days, probably keeping keeping the interface unchanged. This is an important part of one package of mine, so I expect to keep it close to the best it can be. I hope it helps.