Is there a GPU optimized `Distances` package?

I wonder if there is a package which implements Distances.jl's interface with CUDA.AbstractGPUArray type to leverage GPU acceleration. I found those issues:

From them, it seems that it’s impossible (improper) to introduce CUDA specific type to Distances.jl even with Requires. So maybe is there a package like DistributionsAD.jl which extends CUDA specific methods with Distances.jl's interface?

For example, I wrote following code to speed up my Haversine distance computing:

using Distances
import Distances: pairwise
using CUDA

function pairwise(dist::Haversine, x::CUDA.AbstractGPUArray{Float32, 2}, y::CUDA.AbstractGPUArray{Float32, 2})
    # x, y: (2, n) GPU Array
    
    # longtitudes
    Δλ = deg2rad.(y[1,:] .- x[1,:]')
    
    # latitudes
    φ₁ = deg2rad.(x[2,:])
    φ₂ = deg2rad.(y[2,:])
    
    Δφ = φ₂ .- φ₁'
    
    # haversine formula
    a = sin.(Δφ./2).^2 .+  cos.(φ₂) .* cos.(φ₁') .* sin.(Δλ./2) .^ 2
    
    # distance on the sphere
    return 2 .* dist.radius .* asin.(min.(sqrt.(a), fill!(similar(a), 1))) # take care of floating point errors
end

While I don’t consider any deep optimization, the performance boosting looks huge:

using BenchmarkTools

hd = Haversine(100)

tx2 = randn(2, 10000)
cu_tx2 = cu(tx2)

# precompile and ensure the correctness
pairwise(hd, tx2, tx2) ≈ collect(pairwise(hd, cu_tx2, cu_tx2))
# true

@benchmark pairwise(hd, tx2, tx2)
#=
BenchmarkTools.Trial: 
  memory estimate:  762.94 MiB
  allocs estimate:  3
  --------------
  minimum time:     3.525 s (0.04% GC)
  median time:      3.532 s (0.03% GC)
  mean time:        3.532 s (0.03% GC)
  maximum time:     3.539 s (0.03% GC)
  --------------
  samples:          2
  evals/sample:     1
=#

@benchmark pairwise(hd, cu_tx2, cu_tx2)
#=
BenchmarkTools.Trial: 
  memory estimate:  17.45 KiB
  allocs estimate:  505
  --------------
  minimum time:     1.789 ms (0.00% GC)
  median time:      39.941 ms (0.00% GC)
  mean time:        39.150 ms (0.90% GC)
  maximum time:     87.350 ms (2.36% GC)
  --------------
  samples:          128
  evals/sample:     1
=#

Does your function definition really require CUDA? I would expect that replacing CUDA.AbstractGPUArray{Float32, 2} with AbstractMatrix gives exactly the same results and performance.

AbstractMatrix will override original implementation of Distances.jl (the performance will be worse than original implementation on CPU I guess). But it’s possible to just using GPUArrays to prevent using CUDA, so GPUArrays.JLArrays can be used to test code in pure CPU mode.

Such a package would be useful as I have noticed someone is trying to add GPU support using similar style. If a package DistancesGPU.jl exists, one will just using it and pass a GPUArray to the normal Distances.jl function.

It’s possible that Distances should only call its methods on things fairly close to ::Array, and then methods like this for ::AbstractArray could be added there.

Yours seems to be quicker than what’s in Distances on the CPU, and can be made faster. You could probably avoid some more trig by re-writing in terms of cos(φ/2).

julia> function pairwise_bc2(dist::Haversine, x, y)
           @views Δλ = deg2rad.(y[1,:] .- x[1,:]')
           φ₁ = deg2rad.(@view x[2,:])
           φ₂ = deg2rad.(@view y[2,:])
           a = Δλ .= sin.((φ₂ .- φ₁')./2).^2 .+  (cos.(φ₂)) .* (cos.(φ₁))' .* sin.(Δλ./2) .^ 2
           return Δλ .= 2 .* dist.radius .* asin.(min.(sqrt.(a), 1))
       end;

julia> tx3 = randn(2, 2_000);

julia> @btime pairwise_bc1($hd, $tx3, $tx3); # as in Q
  122.366 ms (16 allocations: 152.68 MiB)

julia> @btime pairwise_bc2($hd, $tx3, $tx3);
  88.243 ms (5 allocations: 30.56 MiB)

julia> @btime Distances.pairwise($hd, $tx3, $tx3);
  166.760 ms (3 allocations: 30.52 MiB)

julia> Distances.pairwise(hd, tx3, tx3) ≈ pairwise_bc2(hd, tx3, tx3)
true
1 Like