Fast Distance Histograms

I’m writing distance histograms in Julia, and I think I’ve hit a limit in the performance of my serial code. The problem is pretty simple - given a list of vectors (positions), compute the distance (euclidian) between all unique pairs that can be made between those vectors, then bin these distances into linear ranges. There’s a lot of fast ways of solving the problem, like:

const arr = rand(1:1.0:100, 2, 10000)
out = zeros(10^4, 10^4)
@tullio out[i,j] = sqrt((arr[1,i] - arr[1,j])^2 + (arr[2,i] - arr[2,j])^2)
fit(Histogram, vec(LowerTriangular(out)), bins)

But these ultimately are memory hungry, and fit(Histogram) is the limiting step.

Binning on the fly is possible though, as:

function dist1(r, histo, rmax, nbins)
    @inbounds @fastmath for i = 1:(size(r)[1]-1)
        for j = i+1:size(r)[1]
            d=0.
            @simd for k in 1:2
                tmp=r[i,k]-r[j,k]
                d+=tmp*tmp
            end
            d=sqrt(d)
            rIndex=ceil(Int64, d/rmax*nbins)
            if rIndex>=nbins
                continue
            end
            histo[rIndex]+=1
        end
    end
    return histo
end

points = rand(1000, 2)
hist = zeros(Int, 100)
dist1(points, hist, 5, 100)

You can also write a comparable cross-correlation distance histogram which finds and bins the distances between two lists of vectors:

function dist2(r1, r2, histo, rmax::Int64, nbins::Int64)
    @inbounds @fastmath for i in 1:size(r1)[1]
        for j in 1:size(r2)[1]
            d::Float64 = hypot(r1[i,1]-r2[j,1], r1[i,2]-r2[j,2])
            rIndex=ceil(Int64, d/(rmax*nbins))
            if rIndex>=nbins
                continue
            end
            histo[rIndex]+=1
        end
    end
    return histo
end

And with these you can tile a large dist1 problem into a series of problems split between dist1 and dist2. I’m not too concerned with this tiling (as this post is long enough), but I’d love to know if anybody knows a way computing the result from dist1 and dist2 faster

For reference on my PC, with 1000 2d vectors:

@btime dist1(points, hist, 5, 100)
2.392 ms (0 allocations: 0 bytes)

points2 = points.+1
@btime dist2(points, points2, hist, 5, 100)
4.660 ms (0 allocations: 0 bytes)

Worth noting comparing a pair of identical points will crash this program, but my problem doesn’t generate these points. Also happy to share my blocking approach to parallelise this.

1 Like

If rmax is significantly smaller than the average distance, you should probably use cell lists: Cell lists - Wikipedia

Spatially sorting the data and using the dual tree counting algorithm is an excellent way of counting large collections. However, for smaller collections (1000-10,000 points), the cost of building the tree and calling functions exceeds the speedup. In either case, you need fast ways of computing dist2 and dist1.

Certain spatial distributions (like star clusters) work great with dual tree, but for approximately isotropically distributed points (my case), dual tree will mostly call its edge case (cluster overlaps multiple bins), which is slow.

1 Like

If I understand you need only the distances smaller than rmax. For that the cell lists are particularly useful. But of course there is an overhead which might or not be worthwhile. For roughly homogeneous distributions of 10k points I am pretty sure that it is worthwhile if rmax is about 1/10 of the maximum distance (the difference between running a loop over 100m pairs vs ~1m pairs).

Anyway, to parallelize those loops I would probably try: GitHub - JuliaFolds/FLoops.jl: Fast sequential, threaded, and distributed for-loops for Julia—fold for humans™

I could not accelerate that by parallelizing (everything I tried the overhead was too high). But I found a slightly faster serial version at least (and shorter):

function dist1_2(r, histo, rmax, nbins)
    rb = nbins/rmax
    @inbounds for i = 1:size(r)[1]-1
        for j = i+1:size(r)[1]
            d=sqrt((r[i,1]-r[j,1])^2+(r[i,2]-r[j,2])^2)
            rIndex=ceil(Int, d*rb)
            (rIndex < nbins) && (histo[rIndex] += 1)
        end
    end
    return histo
end
julia> @btime dist1($points, hist, 5, 100) setup=(hist=copy(hist0));
  1.863 ms (0 allocations: 0 bytes)

julia> @btime dist1_2($points, hist, 5, 100) setup=(hist=copy(hist0));
  1.530 ms (0 allocations: 0 bytes)

julia> dist1(points, copy(hist0), 5, 100) == dist1_2(points, copy(hist0), 5, 100)
true



1 Like

Thank you so much! Even a little bit of performance here is a huge help, and the general idea here (fix the nbins/rmax calc, and eliminate the branch) is applicable to both functions.

I tried FLoops but the overhead is indeed too high. For posterity, parallelisation for me looks like:

function tile_dist(points::Array{Float64,2}, histo::Array{Int64,1}, rmax::Int64, nbins::Int64)
    bsize::Int64 = 100
    N = size(points)[1]
    histo .= 0
    if N <= bsize
        return dist1(points, histo, rmax, nbins)
    else
        histos = [zeros(Int64, length(histo)) for i in 1:Threads.nthreads()]
        Threads.@threads for i in 0:(div(N, bsize)-1)
                            dist1(view(points, (i*bsize)+1:((i+1)*bsize), :),
                                 view(histos, Threads.threadid())[1],
                                 rmax,
                                 nbins)
        end
        @sync for il in 1:bsize:N+1
            for jl in il+bsize:bsize:N-bsize+1
                Threads.@spawn dist2(view(points, il:il+bsize-1, :),
                        view(points, jl:jl+bsize-1, :),
                        view(histos, Threads.threadid())[1],
                        rmax,
                        nbins)
            end
        end
        for i in 1:Threads.nthreads() histo .+= view(histos, i)[1] end
    end
    return histo
end

points = rand(1000, 2)
hist = zeros(Int, 100)
@btime tile_dist(points, hist, 5, 100)
454.900 μs (360 allocations: 62.41 KiB)

Blocking the lower traingular matrix into a set of triangular blocks and a set of square blocks is a good all purpose way of splitting most pairwise comparisons into more tractable problems. The computation here is missing the code to calculate the residual part, so a bit of an unfair comparison. I’m not sure where the extra assignments are coming from, though I presume it’s Futures from @spawn. Not super impressed at the speedup, might be from the effort of spinning up threads.

1 Like

Probably Chirs Elrod can accelerate that 100 times with CheapThreads. But for mortals like me it is still not documented (the example tests are simple though).

1 Like

Well, after a bit of messing around, I have the quickest version yet, based on the code from @lmiq. I think with less threading overhead, this could be quicker still. I looked into CheapThreads, but it’s not too clear how to use it (yet). Hopefully more on this soon. My final version is:

function dist1_2(r, histo, rmax, nbins)
    rb = nbins/rmax
    @inbounds for i = 1:size(r)[1]-1
        for j = i+1:size(r)[1]
            d=sqrt((r[i,1]-r[j,1])^2+(r[i,2]-r[j,2])^2)
            rIndex=ceil(Int, d*rb)
            (rIndex < nbins) && (histo[rIndex] += 1)
        end
    end
    return histo
end

function dist2_2(r1, r2, histo, rmax::Int64, nbins::Int64)
    rb = nbins/rmax
    @inbounds for i in 1:size(r1)[1]
        for j in 1:size(r2)[1]
            d=sqrt((r1[i,1]-r2[j,1])^2+(r1[i,2]-r2[j,2])^2)
            rIndex=ceil(Int, d*rb)
            (rIndex < nbins) && (histo[rIndex] += 1)
        end
    end
    return histo
end

function tile_dist(points::Array{Float64,2}, histo::Array{Int64,1}, rmax::Int64, nbins::Int64)
    bsize::Int64 = 100
    N = size(points)[1]
    histo .= 0
    if N <= bsize
        return dist1(points, histo, rmax, nbins)
    else
        histos = [zeros(Int64, length(histo)) for i in 1:Threads.nthreads()]
        Threads.@threads for i in 0:(div(N, bsize)-1)
                            dist1_2(view(points, (i*bsize)+1:((i+1)*bsize), :),
                                 view(histos, Threads.threadid())[1],
                                 rmax,
                                 nbins)
        end
        @sync for il in 1:bsize:N+1
            for jl in il+bsize:bsize:N-bsize+1
                Threads.@spawn dist2_2(view(points, il:il+bsize-1, :),
                        view(points, jl:jl+bsize-1, :),
                        view(histos, Threads.threadid())[1],
                        rmax,
                        nbins)
            end
        end
        epoint = (N ÷ bsize)*bsize
    dist1_2(view(points, epoint+1:N, :),
         view(histos, 1)[1],
         rmax,
         nbins)
     dist2_2(view(points, epoint+1:N, :),
             view(points, 1:epoint, :),
             view(histos, Threads.threadid())[1],
             rmax,
             nbins)
        for i in 1:Threads.nthreads() histo .+= view(histos, i)[1] end
    end
    return histo
end

points = rand(1000, 2)
hist = zeros(Int, 100)

@btime tile_dist(points, hist, 5, 100)
345.900 μs (360 allocations: 62.41 KiB)

s = dist1(points, hist, 5, 100)
w = copy(tile_dist(points, hist, 5, 100))
w == s
true

Hope it’s useful for someone!

1 Like