# Is there a GPU optimized `Distances` package?

I wonder if there is a package which implements `Distances.jl`'s interface with `CUDA.AbstractGPUArray` type to leverage GPU acceleration. I found those issues:

From them, it seems that it’s impossible (improper) to introduce CUDA specific type to `Distances.jl` even with `Requires`. So maybe is there a package like `DistributionsAD.jl` which extends CUDA specific methods with `Distances.jl`'s interface?

For example, I wrote following code to speed up my Haversine distance computing:

``````using Distances
import Distances: pairwise
using CUDA

function pairwise(dist::Haversine, x::CUDA.AbstractGPUArray{Float32, 2}, y::CUDA.AbstractGPUArray{Float32, 2})
# x, y: (2, n) GPU Array

# longtitudes

# latitudes

Δφ = φ₂ .- φ₁'

# haversine formula
a = sin.(Δφ./2).^2 .+  cos.(φ₂) .* cos.(φ₁') .* sin.(Δλ./2) .^ 2

# distance on the sphere
return 2 .* dist.radius .* asin.(min.(sqrt.(a), fill!(similar(a), 1))) # take care of floating point errors
end
``````

While I don’t consider any deep optimization, the performance boosting looks huge:

``````using BenchmarkTools

hd = Haversine(100)

tx2 = randn(2, 10000)
cu_tx2 = cu(tx2)

# precompile and ensure the correctness
pairwise(hd, tx2, tx2) ≈ collect(pairwise(hd, cu_tx2, cu_tx2))
# true

@benchmark pairwise(hd, tx2, tx2)
#=
BenchmarkTools.Trial:
memory estimate:  762.94 MiB
allocs estimate:  3
--------------
minimum time:     3.525 s (0.04% GC)
median time:      3.532 s (0.03% GC)
mean time:        3.532 s (0.03% GC)
maximum time:     3.539 s (0.03% GC)
--------------
samples:          2
evals/sample:     1
=#

@benchmark pairwise(hd, cu_tx2, cu_tx2)
#=
BenchmarkTools.Trial:
memory estimate:  17.45 KiB
allocs estimate:  505
--------------
minimum time:     1.789 ms (0.00% GC)
median time:      39.941 ms (0.00% GC)
mean time:        39.150 ms (0.90% GC)
maximum time:     87.350 ms (2.36% GC)
--------------
samples:          128
evals/sample:     1
=#
``````

Does your function definition really require CUDA? I would expect that replacing `CUDA.AbstractGPUArray{Float32, 2}` with `AbstractMatrix` gives exactly the same results and performance.

`AbstractMatrix` will override original implementation of `Distances.jl` (the performance will be worse than original implementation on CPU I guess). But it’s possible to just `using GPUArrays` to prevent `using CUDA`, so `GPUArrays.JLArrays` can be used to test code in pure CPU mode.

Such a package would be useful as I have noticed someone is trying to add GPU support using similar style. If a package `DistancesGPU.jl` exists, one will just `using` it and pass a `GPUArray` to the normal `Distances.jl` function.

It’s possible that Distances should only call its methods on things fairly close to `::Array`, and then methods like this for `::AbstractArray` could be added there.

Yours seems to be quicker than what’s in Distances on the CPU, and can be made faster. You could probably avoid some more trig by re-writing in terms of `cos(φ/2)`.

``````julia> function pairwise_bc2(dist::Haversine, x, y)
@views Δλ = deg2rad.(y[1,:] .- x[1,:]')
a = Δλ .= sin.((φ₂ .- φ₁')./2).^2 .+  (cos.(φ₂)) .* (cos.(φ₁))' .* sin.(Δλ./2) .^ 2
return Δλ .= 2 .* dist.radius .* asin.(min.(sqrt.(a), 1))
end;

julia> tx3 = randn(2, 2_000);

julia> @btime pairwise_bc1(\$hd, \$tx3, \$tx3); # as in Q
122.366 ms (16 allocations: 152.68 MiB)

julia> @btime pairwise_bc2(\$hd, \$tx3, \$tx3);
88.243 ms (5 allocations: 30.56 MiB)

julia> @btime Distances.pairwise(\$hd, \$tx3, \$tx3);
166.760 ms (3 allocations: 30.52 MiB)

julia> Distances.pairwise(hd, tx3, tx3) ≈ pairwise_bc2(hd, tx3, tx3)
true
``````
1 Like