Getting the benchmarked speedup from Distances.jl

I am studying the possibility of using Distances.jl to accelerate the computation of Euclidean distances between atom coordinates in a package, but I am not getting something right, because my simple loops perform better than the implementation in Distances, and from the benchmarks I was expecting a significant speedup in this example:

using BenchmarkTools
using Distances

x = rand(3,10000)
y = rand(3,10000)

@inbounds function sumd(x,y)
  sumd = 0.
  for i in 1:size(x,2)
    sumd += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2+(x[3,i]-y[3,i])^2)
  end
  return sumd
end

sumd(x,y)
println(" simple loop: ")
@btime s1 = sumd($x,$y)

sumd2(x,y) = sum(colwise(Euclidean(),x,y))

sumd2(x,y)
println(" using colwise: ")
@btime s2 = sumd2($x,$y)

println("Results are the same:", s1 ≈ s2)

Results:

julia> include("./teste.jl")
 simple loop: 
  20.678 μs (0 allocations: 0 bytes)
 using colwise: 
  30.844 μs (2 allocations: 78.20 KiB)
Results are the same:true

Any ideas? Thank you.

edit: The inbounds flags does not make much of a difference there.

colwise allocates a bunch of memory, which slows things down. You can get a minor improvement using Tullio, e.g.

using Tullio
function sumd3(x,y)
    @tullio threads=false rv = sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2+(x[3,i]-y[3,i])^2)
end

On my machine the gain is about 10%.

1 Like

Can you change the data layout? Right now, you’re basically using an array of structs
(colum-major storage in Julia). If you change that to a struct of arrays style by
transposing x and y, you can really make use of SIMD and gain a significant speedup.
For example, I get

julia> using BenchmarkTools, Test, Tullio

julia> N = 10_000; x = randn(3, N); y = randn(3, N); xt = Array(x'); yt = Array(y');

julia> @inbounds function sumd(x,y)
         sumd = 0.
         for i in 1:size(x,2)
           sumd += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2+(x[3,i]-y[3,i])^2)
         end
         return sumd
       end
sumd (generic function with 1 method)

julia> function sumd4(xt, yt)
           @tullio threads=false rv = sqrt((xt[i,1]-yt[i,1])^2 + (xt[i,2]-yt[i,2])^2 + (xt[i,3]-yt[i,3])^2)
       end
sumd4 (generic function with 1 method)

julia> @test sumd(x, y) ≈ sumd4(xt, yt)
Test Passed

julia> @benchmark sumd($x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.193 μs (0.00% GC)
  median time:      17.236 μs (0.00% GC)
  mean time:        17.497 μs (0.00% GC)
  maximum time:     48.158 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark sumd4($xt, $yt)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     6.335 μs (0.00% GC)
  median time:      6.345 μs (0.00% GC)
  mean time:        6.416 μs (0.00% GC)
  maximum time:     17.022 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5
2 Likes

The actual code is a somewhat more complicated (I have to compute all distances withing a cutoff of every other atom - typical stuff of atom simulations - and I am using a linked cell method avoid looping over all atoms). Since I have to annotate which atoms belong to each region in space and then pass the coordinates of these atoms to the routines to compute distances, I use a lot of views of the 3-coordinates of each atom, and the column-based layout is better for that, because the atom coordinates are contiguous in memory in the views. Thus, from a global point of view of the code, keeping the current layout is better. I actually changed the layout one time (on the opposite direction), and got 40% speedup from using a column-based storage.

However, the “Distances” package (and the performance tips of the docs) suggest that with the appropriate data structure one could get 100-fold speedup on the computation of distances between sets, by the use of BLAS optimized routines.

If I manage to get such a speedup with a specific data structure, I will consider rethinking the layout of the code to adapt to that. The example above seemed to me the simplest case to start with, and I do not get any speedup from using the Distances routines, so I am probably not figuring out how to use these functions to achieve the speedup they get.

1 Like

ps. Although I do not think that it will adapt to my case, thank you both to let me know about the Tullio stuff.

1 Like

Your original loop version would benefit from transposing x and y, and using a SIMD-able sqrt, e.g. Base.FastMath.squrt_fast or Base.sqrt_llvm. This is basically the same code as the Tullio version:

function sumd5(x,y)
  sumd = 0.
  @inbounds @fastmath for i in 1:size(x,2)
    sumd += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2+(x[i,3]-y[i,3])^2)
  end
  return sumd
end
@benchmark sumd5($xt,$yt)

Benchmarks:

julia> N = 10_000; x = randn(3, N); y = randn(3, N); xt = Array(x'); yt = Array(y');

julia> @btime sumd($x, $y)
  17.430 μs (0 allocations: 0 bytes)
22558.144176093472

julia> @btime sumd2($x, $y)
  23.209 μs (2 allocations: 78.20 KiB)
22558.14417609351

julia> @btime sumd4($xt, $yt)
  7.397 μs (1 allocation: 96 bytes)
22558.144176093512

julia> @btime sumd5($xt, $yt)
  7.356 μs (0 allocations: 0 bytes)
22558.144176093512
1 Like

An array of SVector{3} would probably be easier and faster to work with.

3 Likes

Thank you all again. I will specify a little bit further the problem, with another example. The sets of atomic coordinates for which the distances have to be computed are subsets of the total set of atomic coordinates. Essentially random sets (because I will compute the distances between atoms which are closer to each other than a specified cutoff, and this is essentially random given the output of a simulation). Thus, the sets of coordinates that are actually being considered for the distance calculations are non-contiguous subsets of the total number of atoms. For this reason, the column-based approach is faster than the row-based approach, we get more continuity by being sure that at least the coordinates of each atom are contiguous in memory (in the example now I do not show the benchmark of the row-based approach, which is worse than that of the column-based one).

I have rewritten the example in a more realistic way. Both fastmath and SVectors accelerate the calculations (thank you for the suggestions). Restructuring the code to use StaticArrays wherever possible will take some time, but I will try that.

I still wander if using BLAS within the Distances packages does not have any role here. I couldn’t get any improvement in performance from using its routines yet.

The current set of tests is the following.

EDIT: Here I just copy the subsets to get contiguous sets
in memory: I will post the results without copying in a minute.


using Test
using BenchmarkTools
using StaticArrays


# Data and size of sets

n = 10_000
xall = rand(3,n)
yall = rand(3,n)

# random subsets of 500 x points and 1000 y points will be considered
nset_x = 500
nset_y = 1000

println("----------------\n COLUMN BASED:\n ---------------- ")

x = copy(xall[:,collect(rand(1:n,nset_x))]) # copying here to be "fair" with the next text
y = copy(yall[:,collect(rand(1:n,nset_y))])

function sumd(x,y)
  sumd = 0.
  for i in 1:size(x,2)
    for j in 1:size(y,2)
      sumd += sqrt((x[1,i]-y[1,j])^2+(x[2,i]-y[2,j])^2+(x[3,i]-y[3,j])^2)
    end
  end
  return sumd
end

function sumd2(x,y)
  sumd = 0.
  @inbounds @fastmath for i in 1:size(x,2)
    for j in 1:size(y,2)
      sumd += sqrt((x[1,i]-y[1,j])^2+(x[2,i]-y[2,j])^2+(x[3,i]-y[3,j])^2)
    end
  end
  return sumd
end

s1 = sumd(x,y)
s2 = sumd2(x,y)

println(" simple loop: ")
@btime sumd($x,$y)

println(" inbounds, fast-math: ")
@btime sumd2($x,$y)

println("----------------\n Using StaticArrays:\n ---------------- ")

sx = [ SVector{3,Float64}(x[1:3,i]...) for i in 1:size(x,2) ]
sy = [ SVector{3,Float64}(y[1:3,i]...) for i in 1:size(y,2) ]

function sumd_SA(x,y)
  sumd = 0.
  for i in 1:length(x)
    for j in 1:length(y)
      sumd += sqrt((x[i][1]-y[j][1])^2+(x[i][2]-y[j][2])^2+(x[i][3]-y[j][3])^2)
    end
  end
  return sumd
end

function sumd2_SA(x,y)
  sumd = 0.
  @inbounds @fastmath for i in 1:length(x)
    for j in 1:length(y)
      sumd += sqrt((x[i][1]-y[j][1])^2+(x[i][2]-y[j][2])^2+(x[i][3]-y[j][3])^2)
    end
  end
  return sumd
end

s3 = sumd_SA(sx,sy)
s4 = sumd2_SA(sx,sy)
@test s1 ≈ s2 ≈ s3 ≈ s4

println(" simple loop: ")
@btime sumd_SA($sx,$sy)

println(" inbounds, fast-math: ")
@btime sumd2_SA($sx,$sy)

 

Results:

 COLUMN BASED:
 ---------------- 
 simple loop: 
  1.006 ms (0 allocations: 0 bytes)
 fast-math: 
  751.763 μs (0 allocations: 0 bytes)
----------------
 Using StaticArrays:
 ---------------- 
 simple loop: 
  752.988 μs (0 allocations: 0 bytes)
 fast-math: 
  551.326 μs (0 allocations: 0 bytes)

It seems that I might get a ~50% speedup using these features appropriately. It is worthwhile the try.

If the subsets are views of the complete subset (which is the case in my problem, except if I decide to copy the data), the use of StaticArrays is less important:


using Test
using BenchmarkTools
using Distances
using StaticArrays


# Data and size of sets

n = 10_000
xall = rand(3,n)
yall = rand(3,n)

nset_x = 500
nset_y = 1000

setx = collect(rand(1:n,nset_x))
sety = collect(rand(1:n,nset_y))

println("----------------\n COLUMN BASED:\n ---------------- ")

x = @view(xall[:,setx])
y = @view(yall[:,sety])

function sumd(x,y)
  sumd = 0.
  for i in 1:size(x,2)
    for j in 1:size(y,2)
      sumd += sqrt((x[1,i]-y[1,j])^2+(x[2,i]-y[2,j])^2+(x[3,i]-y[3,j])^2)
    end
  end
  return sumd
end

function sumd2(x,y)
  sumd = 0.
  @inbounds @fastmath for i in 1:size(x,2)
    for j in 1:size(y,2)
      sumd += sqrt((x[1,i]-y[1,j])^2+(x[2,i]-y[2,j])^2+(x[3,i]-y[3,j])^2)
    end
  end
  return sumd
end

s1 = sumd(x,y)
s2 = sumd2(x,y)

println(" simple loop: ")
@btime sumd($x,$y)

println(" inbounds, fast-math: ")
@btime sumd2($x,$y)

println("----------------\n Using StaticArrays:\n ---------------- ")

sallx = [ SVector{3,Float64}(xall[1:3,i]...) for i in 1:size(xall,2) ]
sally = [ SVector{3,Float64}(yall[1:3,i]...) for i in 1:size(yall,2) ]

sx = @view(sallx[setx])
sy = @view(sally[sety])

function sumd_SA(x,y)
  sumd = 0.
  for i in 1:length(x)
    for j in 1:length(y)
      sumd += sqrt((x[i][1]-y[j][1])^2+(x[i][2]-y[j][2])^2+(x[i][3]-y[j][3])^2)
    end
  end
  return sumd
end

function sumd2_SA(x,y)
  sumd = 0.
  @inbounds @fastmath for i in 1:length(x)
    for j in 1:length(y)
      sumd += sqrt((x[i][1]-y[j][1])^2+(x[i][2]-y[j][2])^2+(x[i][3]-y[j][3])^2)
    end
  end
  return sumd
end

s3 = sumd_SA(sx,sy)
s4 = sumd2_SA(sx,sy)
@test s1 ≈ s2 ≈ s3 ≈ s4

println(" simple loop: ")
@btime sumd_SA($sx,$sy)

println(" inbounds, fast-math: ")
@btime sumd2_SA($sx,$sy)

     

Results:

----------------
 COLUMN BASED:
 ---------------- 
 simple loop: 
  1.488 ms (0 allocations: 0 bytes)
 inbounds, fast-math: 
  691.384 μs (0 allocations: 0 bytes)
----------------
 Using StaticArrays:
 ---------------- 
 simple loop: 
  1.157 ms (0 allocations: 0 bytes)
 inbounds, fast-math: 
  709.758 μs (0 allocations: 0 bytes)