I’m writing distance histograms in Julia, and I think I’ve hit a limit in the performance of my serial code. The problem is pretty simple - given a list of vectors (positions), compute the distance (euclidian) between all unique pairs that can be made between those vectors, then bin these distances into linear ranges. There’s a lot of fast ways of solving the problem, like:

```
const arr = rand(1:1.0:100, 2, 10000)
out = zeros(10^4, 10^4)
@tullio out[i,j] = sqrt((arr[1,i] - arr[1,j])^2 + (arr[2,i] - arr[2,j])^2)
fit(Histogram, vec(LowerTriangular(out)), bins)
```

But these ultimately are memory hungry, and fit(Histogram) is the limiting step.

Binning on the fly is possible though, as:

```
function dist1(r, histo, rmax, nbins)
@inbounds @fastmath for i = 1:(size(r)[1]-1)
for j = i+1:size(r)[1]
d=0.
@simd for k in 1:2
tmp=r[i,k]-r[j,k]
d+=tmp*tmp
end
d=sqrt(d)
rIndex=ceil(Int64, d/rmax*nbins)
if rIndex>=nbins
continue
end
histo[rIndex]+=1
end
end
return histo
end
points = rand(1000, 2)
hist = zeros(Int, 100)
dist1(points, hist, 5, 100)
```

You can also write a comparable cross-correlation distance histogram which finds and bins the distances between two lists of vectors:

```
function dist2(r1, r2, histo, rmax::Int64, nbins::Int64)
@inbounds @fastmath for i in 1:size(r1)[1]
for j in 1:size(r2)[1]
d::Float64 = hypot(r1[i,1]-r2[j,1], r1[i,2]-r2[j,2])
rIndex=ceil(Int64, d/(rmax*nbins))
if rIndex>=nbins
continue
end
histo[rIndex]+=1
end
end
return histo
end
```

And with these you can tile a large dist1 problem into a series of problems split between dist1 and dist2. I’m not too concerned with this tiling (as this post is long enough), but I’d love to know if anybody knows a way computing the result from dist1 and dist2 faster

For reference on my PC, with 1000 2d vectors:

```
@btime dist1(points, hist, 5, 100)
2.392 ms (0 allocations: 0 bytes)
points2 = points.+1
@btime dist2(points, points2, hist, 5, 100)
4.660 ms (0 allocations: 0 bytes)
```

Worth noting comparing a pair of identical points will crash this program, but my problem doesn’t generate these points. Also happy to share my blocking approach to parallelise this.