Pairwise computation slower than Python (Cython) code (BallTree very slow!)

Ok. So I managed to run your example. Here are some comments and possible improvements:

1. Shape of the position and velocity arrays

As mentioned above, Julia is column-major. Thus, for the coordinates of each point to be contiguous in memory in a matrix, the matrix must have dimensions (3,N) and not (N,3). Thus, consider changing your code to store the points in a matrix like created with:

points = rand(Float64,3,N)

instead of rand(Float64,N,3) like you are doing. Another possibility (which is the one I use in the examples of my package) is to use a vector of static vectors. However, with the layout above, you con convert one into the other using:

new_points = reshape(reinterpret(SVector{3,Float64},points),size(points,2))

This will not copy the whole data to the vector of static vectors, as before.

By doing that, allocations are cut in half:

julia> @time test()
  0.575401 seconds (1.34 M allocations: 153.026 MiB, 4.67% gc time)
5-element Vector{Float64}:

But that had nothing to do with the actual code, because the allocations were only associated to the conversion of the input vectors.

Here is the working code with that change only:

Code
using CellListMap
using StaticArrays
using LinearAlgebra

function get_periodic_difference_cell_lists( x, y, period)
    delta = abs.(x - y)
    return @. ifelse(delta > 0.5 * period, delta - period, delta )* sign(x-y)
end

function compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins,sides)
    d = get_periodic_difference_cell_lists(x,y,sides)
    r = LinearAlgebra.norm(d)
    ibin = searchsortedfirst(rbins, r) - 1
    hist[1][ibin] += 1
    hist[2][ibin] += LinearAlgebra.dot(velocities[i] - velocities[j],d)/r
    return hist
end

function get_pairwise_velocity_radial_mean_cell_lists(
        positions, velocities,
        rbins,
        boxsize
        )
    n = size(positions)[2]
    r_max = maximum(rbins)
    lc = LinkedLists(n)
    box = Box(boxsize, r_max)
    positions = reshape(reinterpret(SVector{3,Float64},positions),n)
    velocities = reshape(reinterpret(SVector{3,Float64},positions),n)
    initlists!(positions,box,lc)
    hist = (zeros(Int,length(rbins)-1), zeros(Float64,length(rbins)-1))
    hist = map_pairwise!(
            (x,y,i,j,d2,hist) -> compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins, boxsize),
            hist, positions, box, lc,
    )
    n_pairs = hist[1]
    mean_v_r = hist[2]
    mean_v_r[n_pairs .> 0] = mean_v_r[n_pairs .> 0]./n_pairs[n_pairs .> 0]
    return mean_v_r
end

function test()
  n_halos = 100_000
  boxsize = 250.
  Lbox = [boxsize,boxsize,boxsize]
  positions = boxsize .* rand(Float64, 3, n_halos)
  velocities = rand(Float64, 3, n_halos)
  rbins = [0.,2.,4.,6.,8.,10.]
  r_max = maximum(rbins)
  
  get_pairwise_velocity_radial_mean_cell_lists(
    positions,
    velocities,
    rbins,
    Lbox
  )

end


2. Solve allocations in get_periodic_difference_cell_lists

That function is allocating, but not because abs.(x-y) as I initially suspected, becuse x and y where already static arrays. The issue was the other line:

 return @. ifelse(delta > 0.5 * period, delta - period, delta )* sign(x-y)

In your case this function is receiving two static vectors and one standard array (period). Let us optimize it. It is now:

julia> using StaticArrays, BenchmarkTools

julia> function get_periodic_difference_cell_lists( x, y, period)
           delta = abs.(x - y)
           return @. ifelse(delta > 0.5 * period, delta - period, delta )* sign(x-y)
       end
get_periodic_difference_cell_lists (generic function with 1 method)

julia> x = rand(SVector{3,Float64}); y = rand(SVector{3,Float64}); period = rand(3);

julia> @btime get_periodic_difference_cell_lists(a,b,c) setup=(a=$x;b=$y;c=$period) evals=1
  73.000 ns (1 allocation: 112 bytes)
3-element Vector{Float64}:
 -0.24223519703165342
  0.13349194837831435
 -0.1231010078244783

as you see, it allocates (see this). The most “trivial” way to do that non-allocating is just open that one-liner:

julia> function getdx(x,y,period)
           delta = x - y
           d = delta > 0.5*period ? delta - period : delta
           return d*sign(x-y)
       end
       function get_periodic_difference_cell_lists( x, y, period)
          return (getdx(x[1],y[1],period[1]),
                  getdx(x[2],y[2],period[2]),
                  getdx(x[3],y[3],period[3]))
       end
get_periodic_difference_cell_lists (generic function with 1 method)

julia> @btime get_periodic_difference_cell_lists(a,b,c) setup=(a=$x;b=$y;c=$period) evals=1
  32.000 ns (0 allocations: 0 bytes)
(0.05372893398902234, 0.17882655229010624, 0.8462357221293377)

Ok, good enough, but that looks ugly. You can use a more fancy syntax for that:

julia> function getdx(x,y,period)
           delta = x - y
           return (delta > 0.5*period ? delta - period : delta)*sign(x-y)
       end
       get_periodic_difference_cell_lists(x,y,period) = 
           SVector{3,Float64}(ntuple(i -> getdx(x[i],y[i],period[i]), 3))
          
julia> @btime get_periodic_difference_cell_lists(a,b,c) setup=(a=$x;b=$y;c=$period) evals=1
  33.000 ns (0 allocations: 0 bytes)

I prefer returning a SVector there, but that is optional (a tuple would do fine).

That will solve all important allocations, and the code becomes slightly faster:

julia> @time test()
  0.451519 seconds (106 allocations: 9.925 MiB, 2.07% gc time)
5-element Vector{Float64}:
Code updated
using CellListMap
using StaticArrays
using LinearAlgebra

function getdx(x,y,period)
    delta = x - y
    return (delta > 0.5*period ? delta - period : delta)*sign(x-y)
end
get_periodic_difference_cell_lists(x,y,period) = 
    SVector{3,Float64}(ntuple(i -> getdx(x[i],y[i],period[i]), 3))

function compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins,sides)
    d = get_periodic_difference_cell_lists(x,y,sides)
    r = LinearAlgebra.norm(d)
    ibin = searchsortedfirst(rbins, r) - 1
    hist[1][ibin] += 1
    hist[2][ibin] += LinearAlgebra.dot(velocities[i] - velocities[j],d)/r
    return hist
end

function get_pairwise_velocity_radial_mean_cell_lists(
        positions, velocities,
        rbins,
        boxsize
        )
    n = size(positions)[2]
    r_max = maximum(rbins)
    lc = LinkedLists(n)
    box = Box(boxsize, r_max)
    positions = reshape(reinterpret(SVector{3,Float64},positions),n)
    velocities = reshape(reinterpret(SVector{3,Float64},positions),n)
    initlists!(positions,box,lc)
    hist = (zeros(Int,length(rbins)-1), zeros(Float64,length(rbins)-1))
    hist = map_pairwise!(
            (x,y,i,j,d2,hist) -> compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins, boxsize),
            hist, positions, box, lc,
    )
    n_pairs = hist[1]
    mean_v_r = hist[2]
    mean_v_r[n_pairs .> 0] = mean_v_r[n_pairs .> 0]./n_pairs[n_pairs .> 0]
    return mean_v_r
end

function test()
  n_halos = 100_000
  boxsize = 250.
  Lbox = [boxsize,boxsize,boxsize]
  positions = boxsize .* rand(Float64, 3, n_halos)
  velocities = rand(Float64, 3, n_halos)
  rbins = [0.,2.,4.,6.,8.,10.]
  r_max = maximum(rbins)
  
  get_pairwise_velocity_radial_mean_cell_lists(
    positions,
    velocities,
    rbins,
    Lbox
  )

end


3. My fault

Besides that, there are some difficulties in for the compiler to infer the types of the Box and LinkedLists structures which are my fault. They are not affecting performance here, but they create some few more allocations. While I do not solve them in the package (I hope to do that soon), you can solve them by putting a function barrier, that is, creating box and lc outside the main function and passing them to the hard-core calculation as parameters. This results in:

julia> @time test()
  0.450560 seconds (91 allocations: 9.925 MiB)
Code
using CellListMap
using StaticArrays
using LinearAlgebra

function getdx(x,y,period)
    delta = x - y
    return (delta > 0.5*period ? delta - period : delta)*sign(x-y)
end
get_periodic_difference_cell_lists(x,y,period) = 
    SVector{3,Float64}(ntuple(i -> getdx(x[i],y[i],period[i]), 3))

function compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins,sides)
    d = get_periodic_difference_cell_lists(x,y,sides)
    r = LinearAlgebra.norm(d)
    ibin = searchsortedfirst(rbins, r) - 1
    hist[1][ibin] += 1
    hist[2][ibin] += LinearAlgebra.dot(velocities[i] - velocities[j],d)/r
    return hist
end

function get_pairwise_velocity_radial_mean_cell_lists(
        positions, velocities,
        rbins,
        boxsize, 
        lc, box, n
        )
    positions = reshape(reinterpret(SVector{3,Float64},positions),n)
    velocities = reshape(reinterpret(SVector{3,Float64},positions),n)
    initlists!(positions,box,lc)
    hist = (zeros(Int,length(rbins)-1), zeros(Float64,length(rbins)-1))
    hist = map_pairwise!(
            (x,y,i,j,d2,hist) -> compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins, boxsize),
            hist, positions, box, lc,
    )
    n_pairs = hist[1]
    mean_v_r = hist[2]
    mean_v_r[n_pairs .> 0] = mean_v_r[n_pairs .> 0]./n_pairs[n_pairs .> 0]
    return mean_v_r
end

function test()
  n_halos = 100_000
  boxsize = 250.
  Lbox = [boxsize,boxsize,boxsize]
  positions = boxsize .* rand(Float64, 3, n_halos)
  velocities = rand(Float64, 3, n_halos)
  rbins = [0.,2.,4.,6.,8.,10.]
  
  n = size(positions)[2]
  r_max = maximum(rbins)
  lc = LinkedLists(n)
  box = Box(Lbox, r_max)

  get_pairwise_velocity_radial_mean_cell_lists(
    positions,
    velocities,
    rbins,
    Lbox,
    lc, box, n
  )

end


4. Parallelization

Now, if you provide the proper reduce function, which in your case is:

function reduce_hist(hist,hist_threaded)
  hist = hist_threaded[1]
  for i in 2:Threads.nthreads()
  │ hist[1] .+= hist_threaded[i][1]
  │ hist[2] .+= hist_threaded[i][2]
  end
  return hist
end

You can run that in parallel. Here I have 4 cores (8 threads). And I get:

julia> @time test()
  0.126351 seconds (283 allocations: 9.942 MiB)
Code
using CellListMap
using StaticArrays
using LinearAlgebra

function getdx(x,y,period)
    delta = x - y
    return (delta > 0.5*period ? delta - period : delta)*sign(x-y)
end
get_periodic_difference_cell_lists(x,y,period) = 
    SVector{3,Float64}(ntuple(i -> getdx(x[i],y[i],period[i]), 3))

function compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins,sides)
    d = get_periodic_difference_cell_lists(x,y,sides)
    r = LinearAlgebra.norm(d)
    ibin = searchsortedfirst(rbins, r) - 1
    hist[1][ibin] += 1
    hist[2][ibin] += LinearAlgebra.dot(velocities[i] - velocities[j],d)/r
    return hist
end

function reduce_hist(hist,hist_threaded)
  hist = hist_threaded[1]
  for i in 2:Threads.nthreads()
    hist[1] .+= hist_threaded[i][1]
    hist[2] .+= hist_threaded[i][2]
  end
  return hist
end

function get_pairwise_velocity_radial_mean_cell_lists(
  positions, velocities,
  rbins,
  boxsize, 
  lc, box, n
)
    positions = reshape(reinterpret(SVector{3,Float64},positions),n)
    velocities = reshape(reinterpret(SVector{3,Float64},positions),n)
    initlists!(positions,box,lc)
    hist = (zeros(Int,length(rbins)-1), zeros(Float64,length(rbins)-1))
    hist = map_pairwise!(
      (x,y,i,j,d2,hist) -> 
      compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins, boxsize),
      hist, positions, box, lc,
      reduce=reduce_hist
    )
    n_pairs = hist[1]
    mean_v_r = hist[2]
    mean_v_r[n_pairs .> 0] = mean_v_r[n_pairs .> 0]./n_pairs[n_pairs .> 0]
    return mean_v_r
end

function test()
    n_halos = 100_000
    boxsize = 250.
    Lbox = [boxsize,boxsize,boxsize]
    positions = boxsize .* rand(Float64, 3, n_halos)
    velocities = rand(Float64, 3, n_halos)
    rbins = [0.,2.,4.,6.,8.,10.]
      
    n = size(positions)[2]
    r_max = maximum(rbins)
    lc = LinkedLists(n)
    box = Box(Lbox, r_max)

    get_pairwise_velocity_radial_mean_cell_lists(
      positions,
      velocities,
      rbins,
      Lbox,
      lc, box, n
    )

end

Note: all times reported are running @time test() twice, as the first run will count the compilation time and allocations.

@florpi and @etienne_dg I tested the python/cython code here (hopefully correctly). Apparently it is still somewhat faster (single threaded) than my code for 100k atoms, and it is much faster (10s vs 2 minutes) for 1 million atoms. I am mostly interested to know what is the difference. Are you computing the same thing in both cases? (florpi is building an histogram of the velocities and returning it), to compute some average over the histgoram at the end. Even if there is some difference there, probably that is not the main cause of difference. Are you using more than one mesh to classify the points in space? I will try to study the code, but any insight will be helpful.

edit: I figured out what is the algorithmic difference. @florpi I will be improving that in the next days, probably keeping keeping the interface unchanged. This is an important part of one package of mine, so I expect to keep it close to the best it can be. I hope it helps.

4 Likes

Your allocating inner function can also be written like this:

julia> function get_periodic_difference_cell_lists( x, y, period)
                  delta = abs.(x - y)
                  return ifelse.(delta .> 0.5 * period, delta - period, delta ) .* sign.(x-y)
              end
get_periodic_difference_cell_lists (generic function with 1 method)

julia> @btime get_periodic_difference_cell_lists(a,b,c) setup=(a=$x;b=$y;c=$period) evals=1
  31.000 ns (0 allocations: 0 bytes)
3-element SVector{3, Float64} with indices SOneTo(3):
 0.29932718543852777
 0.213959988559693
 0.14622802875517515

which is pretty much the same you had, except that there is a dot .>, another one in .* and another one in sign.(x-y).

Thanks so much! May I ask what is the algorithmic difference that you found? I was finding the same thing, your code is faster for fewer than 100k particles and scales badly as the number of particles approaches 1 million.

Also this is odd, when I run your code (exactly as it is, with 100_000 particles) I still get lots of allocations,

julia> @time test()
  0.876833 seconds (1.54 M allocations: 56.900 MiB, 0.99% gc time)

Here is the exact code,

using CellListMap
using StaticArrays
using LinearAlgebra

export test

function getdx(x,y,period)
    delta = x - y
    return (delta > 0.5*period ? delta - period : delta)*sign(x-y)
end
get_periodic_difference_cell_lists(x,y,period) = 
    SVector{3,Float64}(ntuple(i -> getdx(x[i],y[i],period[i]), 3))

function compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins,sides)
    d = get_periodic_difference_cell_lists(x,y,sides)
    r = LinearAlgebra.norm(d)
    ibin = searchsortedfirst(rbins, r) - 1
    hist[1][ibin] += 1
    hist[2][ibin] += LinearAlgebra.dot(velocities[i] - velocities[j],d)/r
    return hist
end

function get_pairwise_velocity_radial_mean_cell_lists(
        positions, velocities,
        rbins,
        boxsize, 
        lc, box, n
        )
    positions = reshape(reinterpret(SVector{3,Float64},positions),n)
    velocities = reshape(reinterpret(SVector{3,Float64},positions),n)
    initlists!(positions,box,lc)
    hist = (zeros(Int,length(rbins)-1), zeros(Float64,length(rbins)-1))
    hist = map_pairwise!(
            (x,y,i,j,d2,hist) -> compute_pairwise_mean_cell_lists!(x,y,i,j,d2,hist,velocities, rbins, boxsize),
            hist, positions, box, lc,
    )
    n_pairs = hist[1]
    mean_v_r = hist[2]
    mean_v_r[n_pairs .> 0] = mean_v_r[n_pairs .> 0]./n_pairs[n_pairs .> 0]
    return mean_v_r
end

function test()
  n_halos = 100_000
  boxsize = 250.
  Lbox = [boxsize,boxsize,boxsize]
  positions = boxsize .* rand(Float64, 3, n_halos)
  velocities = rand(Float64, 3, n_halos)
  rbins = [0.,2.,4.,6.,8.,10.]
  
  n = size(positions)[2]
  r_max = maximum(rbins)
  lc = LinkedLists(n)
  box = Box(Lbox, r_max)

  get_pairwise_velocity_radial_mean_cell_lists(
    positions,
    velocities,
    rbins,
    Lbox,
    lc, box, n
  )

end

If using @time (not @btime) you need at least to run it twice, because the first time it will compile the code and count the compilation:

julia> @time test()
  1.808631 seconds (3.15 M allocations: 186.783 MiB, 3.16% gc time, 75.71% compilation time)
5-element Vector{Float64}:
 0.023388565623097612
 0.0711147862768923
 0.07047094230633603
 0.12305782030896224
 0.2287768107932661

julia> @time test()
  0.443453 seconds (91 allocations: 9.925 MiB)
5-element Vector{Float64}:
 0.20470298373075935
 0.10981588612143005
 0.07945765890176218
 0.03251808023282381
 0.07786507040700635

The best way to benchmark is with @btime, or @benchmark, as you were doing, I was being lazy (@btime takes more to evaluate, because it runs the function several times).

Concerning the algorithm: the difference is between running over the cells and computing the interactions between particles of neighbour cells or running over the particles and computing the interactions with the particles of the neighbour cells of each particle. If the density is constant the second approach might be better (that is what I am doing), because the main loop is not dependent on the number of cells. If the density increases (which was happening in the examples here), however, the scaling is worst, because one is running over the particles (O(n)) and then on a volume for each particle with increasing density (again O(n)). For 10 million with constant density what I was doing is faster than what is implemented in the python version, for example. You can play with that with input options like this:

  density = 10^4/250^3  # density of the original problem
  n_halos = 10_000_000
  boxsize = (n_halos / density)^(1/3)

In that case my code takes 50s and the other 72 seconds, for example.

I will try to implement something that will be good in both cases, or maybe estimate what is best and fall back to one or other alternative. Another case is that of very small systems, in which case computing all distances is just faster. Probably I will keep that as a fall back as well.

@florpi : Those two runs I obtained running exactly what you sent me, copying and pasting from there. If you are not seeing the same thing, we might have to see if there is some package version or something else affecting the outcome.

2 Likes

Sorry, I should have specified that I ran @time twice,

julia> @time test()
  3.936829 seconds (5.13 M allocations: 231.994 MiB, 7.03% gc time)
5-element Array{Float64,1}:
 -0.2990130399226635
 -0.3012712366702908
 -0.19061336525482417
 -0.09103851221773775
 -0.03399958777669336

julia> @time test()
  0.825883 seconds (1.54 M allocations: 56.898 MiB, 0.96% gc time)
5-element Array{Float64,1}:
  0.008097380143777392
 -0.001806040457768022
 -0.07067025952642321
 -0.05817722604771049
  0.00846907227974683

Could it be the julia version I’m using? It’s 1.5.3

Also @lmiq, I found some strange bug. When one of the coordinates is exactly half of the boxsize I get a bounds error when running initlists!

ERROR: LoadError: TaskFailedException:
BoundsError: attempt to access 20×20×20 LinearIndices{3,Tuple{Base.OneTo{Int64},Base.OneTo{Int64},Base.OneTo{Int64}}} at index [10, 21, 8]
Stacktrace:
1 Like

Probably… I am using 1.6.1.

Anyway, thanks for reporting both problems. I will solve the second one while I improve the implementation of the algorithm and also will check if those allocations are “my fault” or not in 1.5.3, even if they do not appear in 1.6.1 it might a point for improvement of the code.

2 Likes

Just an update here, for the register. In recent days I have improved the implementation of CellListMap.jl, which is an alternative to compute the exact same thing that was desired by the OP. It is also general in the sense that it can compute other properties. With the current version (0.4.3) I obtain a similar performance to that of halotools for dense systems, and for constant density and increasing number of particles the package scales really well (linearly) and is quite efficient:

image

image

I think now we can say that we have a good alternative for the original OP question, for sure.

7 Likes