[ANN] SpatialHashTables.jl

SteffenPL · November 21, 2023, 11:48pm

SpatialHashTables.jl (v0.2.0)

This package implements a cell list method via spatial hash tables in Julia.

The primary aim is to speed up computations of short-range interactions between particles in large particle systems, e.g. for terms like F_i = \sum_{j} F(X_i,X_j) for given X_1,\dots,X_n \in \mathbb{R}^d and provided that F(X_i,X_j) \approx 0 for distances larger than a cutoff radius \Vert X_i - X_j \Vert > R.

The aim is similar to that of the excellent CellListMap.jl package. (@lmiq I’m also open to merging, it was just easier to try the idea in a new package first.)

As of now, the main features of SpatialHashTables.jl are:

Support for unbounded domains and also bounded domains,
No allocations, neither during the update of positions nor at the iteration of all neigbours (for single-threaded code).
Can be used on GPUs (but still a bit slow),
Minimal but flexible interface, arbitrary dimension, fast CPU multi-threading.
Core implementation is less than 200 lines of code (without docs).

Is it fast?

On bounded domains, the implemenation has comparable performance to CellListMap.jl for single-threaded calls, but 2x or more times slower in the multi-treaded case [*], see benchmarks and lmiq’s example.
For unbounded domains, I lack a reference to compare to, but it seems to be reasonably fast.
GPU performance is not optimal yet. On my system 10x slower than the CPU. I’m working on it

[*] Note that my current implementation visits both (i,j) and (j,i), which might cause the doubling of the runtime. This might be fixed in the future, but currently, GPU support is the main focus.

Example usage

Please check out the readme and the docs for more details.

Below, we compute F_i = \sum_{j} V(\Vert X_j - X_i \Vert) (X_j - X_i) for given points X_1,\dots,X_N \in \mathbb{R}^3 and a function V(d) = 1/d^3. Note that we only need to provide a tablesize (number of hashes) and a cellsize (defines binning of particles into cells). No domain bounds are needed, which is often convenient.

using StaticArrays, SpatialHashTables

N = 1_000_000
X = randn(SVector{3, Float64}, N)
F = zeros(SVector{3, Float64}, N)
V(d) = 1/d^3
r = 1/N^(1/3)

cellsize = 1/N^(1/3)
tablesize = 1_000_000
ht = SpatialHashTable(X, cellsize, tablesize)

Threads.@threads for i in eachindex(X)
    for j in neighbours(ht, X[i], r)
        d² = sum(x -> x^2, X[i] - X[j])
        if 0 < d² < r^2
            d = sqrt(d²)
            F[i] += V(d) * (X[i] - X[j])
        end
    end
end

# Timing: 0.230324 seconds (190 allocations: 20.023 KiB)

I currently decided to leave the parallelization mostly up to the user to give maximal control and allow use that goes beyond energy/force computations. (E.g. collision detection). There are no helper functions such as CellListsMap.map_pairwise and also no support for periodic domains.

I hope this package is useful for the Julia ecosystem. If you have any questions or comments, please open an issue or contact me.

lmiq · November 22, 2023, 12:55am

Very nice! I like the idea of providing an iterator as the interface, I played a little with that idea in CellListMap.jl to improve the interoperability with Agents.jl, but didn´t have the time to develop it further.

Most of the headaches I have with CellListMap.jl were to support general (triclinic) periodic boundary conditions, is that in your plans?

Concerning the parallelization, the issue I have is that at some point (many cores) constructing the lists becomes the bottleneck, because that is hard to parallelize. Is the construction of the tables in your method easy to parallelize?

SteffenPL · May 13, 2024, 3:28am

Update: Improved GPU support, but still a bit slow…

Since the previous implementation had super slow GPU performance, I rewrote the entire package.
Now the GPU performance is at least on par with parallel CPU performance via KernelAbstractions.jl.

The package is still a work in progress and I hope to be able to make it significantly faster on GPUs. Right now, you should probably use CellListMap.jl instead…

GPU Example

The design of the package is to be as hands-off as possible. Essentially it provides the HashGrid and HashGridQuery. Only the function updatecells!(grid, X) is internally implemented in parallel, but the user is expected/allowed to implement the energy/force computational in parallel by themselves, e.g.

using SpatialHashTables, StaticArrays, LinearAlgebra
using CUDA, KernelAbstractions

const SVec3f = SVector{3,Float32}
IndexType = CUDA.CuVector{Int32}           # backend-friendly index type

N = 10_000
X = cu(rand(SVec3f, N))                # points
r = Float32.(1/N^3)                    # cutoff
bnds = (SVec3f(0,0,0), SVec3f(1,1,1))  # domain bounds

grid = Hash{IndexType}(X, bnds..., r)  # automatically creates cells 

# implement kernel for force computation
@kernel function force_kernel!(F, @Const(X), r, grid)
    i = @index(Global)
    
    Xi = X[i]
    Fi = zero(SVec3d)

    for j in HashGridQuery(grid, Xi, r)
        Xij = Xi - X[j]
        d² = sum(z -> z^2, Xij)
        if 0 < d² < r^2
            Fi += Xij / sqrt(d²)
        end 
    end
    F[i] = Fi    
end

# call kernel
F_gpu = similar(X_gpu)

force_kernel!(grid_gpu.backend, numthreads(grid_gpu))(
        F_gpu, X_gpu, r_gpu, grid_gpu, ndrange = length(X_gpu)
    )

Performance

Currently, CellListMap.jl is faster in the atomic and galactic benchmarks. These are cases with many particles per computational cell.

This package might be faster for cases with fewer particles per computational cell, which is more common in cases like collision detection, which is my core motivation. I uploaded some graphs here:
current benchmarks,

I’m currently trying to figure out why exactly the packages behave so differently.

Next steps

I still believe the GPU performance can be much better. That’s the main focus.

Once I’m satisfied with the performance, I’ll update the documentation.

SteffenPL · May 13, 2024, 3:49am

A very late reply (sorry for that).

Currently I don’t aim at general triclinic boundary conditions as I’m still off in terms of performance and try to minimize the possible performance traps.

For the parallelization of the cell lists my the old version was serial, but I now changed it to parallel updates of the cells via updatecells!(grid, X), see also the implementation.
The strategy was taken from GPU Germs 3 or the HashGrid type in Python Warp:

first, compute cell indices in parallel and write two vectors:
- pointidxs = 1:N and cellidxs = 'cellidx of point i'
then permute both lists such that the cellidxs are sorted, which gives something like
- ```
cellidxs  = [1,   1, 1, 1,  2,  2, 3, 4, ...]
pointidxs = [52, 21, 2, 3, 11, 13, 5, 7, ...] 
```
- this can be done in parallel, ideally with Radix sort, but Julia only has QuickSort and Bitonic sort on GPUs.
now, one can find the starts and ends of each cell by checking if cellidxs[i] != cellidxs[i-1] with a parallel computational over all points.

kristoffer.carlsson · May 13, 2024, 9:06am

I tried the original code and compared it to NearestNeighbors.jl:

using StaticArrays

const N = 1_000_000
const X = randn(SVector{3, Float64}, N)
const V(d) = 1/d^3
const r = 1/N^(1/3)

function compute_force(Xi, Xj, V, r²)
    Xdiff = Xi - Xj
    d² = sum(x -> x^2, Xdiff)
    if 0 < d² < r²
        d = sqrt(d²)
        return V(d) * Xdiff
    end
    return zero(Xi)
end

using SpatialHashTables
function compute_forces_sht(X, r)
    F = zeros(SVector{3, Float64}, N)
    cellsize = 1/N^(1/3)
    tablesize = 1_000_000
    @time ht = SpatialHashTable(X, cellsize, tablesize)
    r² = r^2
    @time for i in eachindex(X)
        for j in neighbours(ht, X[i], r)
           F_ij = compute_force(X[i], X[j], V, r²)
           F[i] += F_ij
        end
    end
    return F
end

using NearestNeighbors
function compute_forces_kdtree(X)
    F = zeros(SVector{3, Float64}, N)
    r² = r^2
    @time kdtree = KDTree(X)
    @time for i in eachindex(X)
            for j in inrange(kdtree, X[i], r)
                F_ij = compute_force(X[i], X[j], V, r²)
                F[i] += F_ij
            end
        end
    end
    return F
end


@info "SHT"
F_sht = compute_forces_sht(X, r)
@info "KD"
F_KD = compute_forces_kdtree(X)

using LinearAlgebra
@show sum(norm, F_sht)
@show sum(norm, F_KD)

I got surprisingly slow results from SpatialHashTables:

[ Info: SHT
  0.029648 seconds (30 allocations: 15.260 MiB)
  9.438751 seconds (303.23 M allocations: 14.228 GiB, 6.47% gc time, 0.05% compilation time)
  9.470024 seconds (303.23 M allocations: 14.265 GiB, 6.45% gc time, 0.05% compilation time)
[ Info: KD
  0.259527 seconds (9 allocations: 39.673 MiB)
  0.565318 seconds (2.00 M allocations: 144.959 MiB, 3.59% gc time)
  0.829747 seconds (2.00 M allocations: 207.524 MiB, 2.82% gc time)
sum(norm, F_sht) = 2.7183271305836573e9
sum(norm, F_KD) = 2.7183271305836573e9

I realized this is because the constructor for a SpatialHashTable is type unstable so using it in the same function as it is created in is slow:

@code_warntype compute_forces_sht(X, r)
...
  ht::SpatialHashTable{_A, Vector{Int64}} where _A

Fixing that issue by splitting out the loop into another function:

using SpatialHashTables
function compute_forces_sht(X, r)
    F = zeros(SVector{3, Float64}, N)
    cellsize = 1/N^(1/3)
    tablesize = 1_000_000
    @time ht = SpatialHashTable(X, cellsize, tablesize)
    _compute_forces_sht!(F, ht, X, r)
    return F
end

function _compute_forces_sht!(F, ht, X, r)
    r² = r^2
    @time for i in eachindex(X)
        for j in neighbours(ht, X[i], r)
           F_ij = compute_force(X[i], X[j], V, r²)
           F[i] += F_ij
        end
    end
end

julia> include("tree.jl")
[ Info: SHT
  0.024163 seconds (30 allocations: 15.260 MiB)
  2.215865 seconds (6 allocations: 3.234 KiB)
  2.316689 seconds (55.75 k allocations: 41.614 MiB, 3.08% compilation time)
[ Info: KD
  0.259127 seconds (9 allocations: 39.673 MiB)
  0.534359 seconds (2.00 M allocations: 137.329 MiB, 0.66% gc time)
  0.799294 seconds (2.00 M allocations: 199.894 MiB, 0.44% gc time)
sum(norm, F_sht) = 2.747263313187687e9
sum(norm, F_KD) = 2.747263313187687e9

showing that it might be possible to squeeze out some from SpatialHasTables. It might also be good to show examples of use cases where a straight up KDTree is not the best data structure and where SpatialHasTables beats it.

SteffenPL · May 13, 2024, 3:02pm

Thanks for trying it and also for the tip with the inner loop!

Ah, I’ll answer something quick first, which is that I didn’t update the registry yet, so the newest changes are on #main.

So the comparison above is not the quickest version I came up with, but yes. Theoretically spatial hash tables could be great for NND when one concentrated cluster within a large but otherwise domain.

I’ll try your benchmark tomorrow to see if there is some performance to squeeze out! Looking at instabilities of the inner loop seems indeed important (even though I had hoped that the new version in main fixed it.) But

SteffenPL · May 20, 2024, 5:10pm

I worked on some updates. Now, the Julia implementation is close to the NVIDIA warp implementation (~only 25% slower) and, in particular, it is sometimes faster than the CPU parallel version. (Code is still just on #master).

Here is an example with few particles per cell:

uniform

As noted before, CellListMap.jl is faster when the density is large compared to the cutoff. But I think the difference now is that CellListMap.jl uses the symmetry to compute only half of the pairs + maybe a matter distribution of the tasks:

xatomic

(Sorry for the mixed colors.)

@kristoffer.carlsson I updated your comparison code with NearestNeighbors.jl to the current SpatialHashTables#master. Thanks for the pointer to the type instability! Interestingly, even the new version had another instability issue which I found thanks to @kristoffer.carlsson’s example

The following runtimes are from a fresh Julia session, showing the first and second runtime. Of course, KernelAbstractions.jl hides probably some allocations here…

[ Info: SHT
  3.943116 seconds (614.19 k allocations: 110.804 MiB, 0.42% gc time, 93.32% compilation time)
  3.350001 seconds (926 allocations: 68.302 MiB, 1.74% gc time)
[ Info: SHT (CPU parallel)
  0.608945 seconds (531.87 k allocations: 104.755 MiB, 4.18% gc time, 1553.41% compilation time)
  0.186577 seconds (1.38 k allocations: 68.351 MiB, 11.05% gc time)
[ Info: KD
  1.023193 seconds (2.00 M allocations: 199.890 MiB, 3.07% gc time)
  1.005817 seconds (2.00 M allocations: 199.890 MiB, 2.77% gc time)

Source code

# source code copied from kristoffer.carlsson
# https://discourse.julialang.org/t/ann-spatialhashtables-jl/106553/5

using StaticArrays

const N = 1_000_000
const X = randn(SVector{3, Float64}, N)
const V(d) = 1/d^3
const r = 1/N^(1/3)

function compute_force(Xi, Xj, V, r²)
    Xdiff = Xi - Xj
    d² = sum(x -> x^2, Xdiff)
    if 0 < d² < r²
        d = sqrt(d²)
        return V(d) * Xdiff
    end
    return zero(Xi)
end

function inner_sht(F, X, grid, i, r, r²)
    Xi = X[i]
    Fi = zero(Xi)
    for j in HashGridQuery(grid, Xi, r)
       F_ij = compute_force(Xi, X[j], V, r²)
       Fi += F_ij
    end
    F[i] = Fi
end

using SpatialHashTables
function compute_forces_sht(X, r)
    F = zeros(SVector{3, Float64}, N)
    cellsize = 1/N^(1/3)
    grid = HashGrid(X, @SVector[0.0,0.0,0.0], @SVector[1.0,1.0,1.0], cellsize; nthreads = Val(1))

    r² = r^2
    for i in eachindex(X)
        inner_sht(F, X, grid, i, r, r²)
    end
    return F
end

using KernelAbstractions
@kernel function inner_sht_parallel!(F, X, grid, r, r²)
    i = @index(Global)
    Xi = X[i]
    Fi = zero(Xi)
    for j in HashGridQuery(grid, Xi, r)
       F_ij = compute_force(Xi, X[j], V, r²)
       Fi += F_ij
    end
    F[i] = Fi
end


function compute_forces_sht_parallel(X, r)
    F = zeros(SVector{3, Float64}, N)
    cellsize = 1/N^(1/3)
    grid = HashGrid(X, @SVector[0.0,0.0,0.0], @SVector[1.0,1.0,1.0], cellsize; nthreads = Val(24))

    r² = r^2
    inner_sht_parallel!(grid.backend)(F, X, grid, r, r², ndrange = length(F))
    synchronize(grid.backend)
    return F
end



function inner_kdtree(F, X, kdtree, i, r, r²)
    Xi = X[i]
    Fi = zero(Xi) 
    for j in inrange(kdtree, Xi, r)
        F_ij = compute_force(Xi, X[j], V, r²)
        Fi += F_ij
    end
    F[i] = Fi
end

using NearestNeighbors
function compute_forces_kdtree(X)
    F = zeros(SVector{3, Float64}, N)
    r² = r^2
    kdtree = KDTree(X)
    for i in eachindex(X) 
        inner_kdtree(F, X, kdtree, i, r, r²)
    end
    return F
end


@info "SHT"
@time F_sht = compute_forces_sht(X, r)
@time F_sht = compute_forces_sht(X, r)

@info "SHT (parallel)"
@time F_sht_p = compute_forces_sht_parallel(X, r)
@time F_sht_p = compute_forces_sht_parallel(X, r)

@info "KD"
@time F_KD = compute_forces_kdtree(X)
@time F_KD = compute_forces_kdtree(X)

using LinearAlgebra
@assert sum(norm, F_sht_p) ≈ sum(norm, F_KD)

I think the runtime on GPUs is not too far off anymore. Next, I will implement proper spatial hash tables, as the current version just uses a fixed periodic grid.

lmiq · May 20, 2024, 6:43pm

CellListMap.jl (without the s), otherwise the link is broken.

Ps. You can avoid using the symmetry of the lists in CellListMap.jl by using a triclinic unit cell (providing the cell as a matrix). It takes almost exactly twice the time relative to orthorhombic cells. The parameters of CellListMap.jl are tuned for typical atomic or galaxy densities and the usual cutoffs used in these applications.

SteffenPL · May 20, 2024, 7:11pm

Thanks for the hints. And also sorry for the typo (I edited it now)!

Topic		Replies	Views
[ANN] CellListMap.jl: short-ranged particle pairwise computations (lennard-jones, gravitational, etc) Package Announcements	38	3108	June 24, 2022
Pairwise computation slower than Python (Cython) code (BallTree very slow!) Performance question , tree	27	2152	June 18, 2021
Accelerate pairwise Lennard-Jones force computation Performance	87	3581	February 12, 2022
Any suggestions for starting doing this kind of neighbour list? Numerics	12	1254	April 16, 2020
Cell list algorithm is slower than double for loop Performance	18	2325	April 2, 2020