I wonder if there is a package which implements `Distances.jl`

's interface with `CUDA.AbstractGPUArray`

type to leverage GPU acceleration. I found those issues:

From them, it seems that it’s impossible (improper) to introduce CUDA specific type to `Distances.jl`

even with `Requires`

. So maybe is there a package like `DistributionsAD.jl`

which extends CUDA specific methods with `Distances.jl`

's interface?

For example, I wrote following code to speed up my Haversine distance computing:

```
using Distances
import Distances: pairwise
using CUDA
function pairwise(dist::Haversine, x::CUDA.AbstractGPUArray{Float32, 2}, y::CUDA.AbstractGPUArray{Float32, 2})
# x, y: (2, n) GPU Array
# longtitudes
Δλ = deg2rad.(y[1,:] .- x[1,:]')
# latitudes
φ₁ = deg2rad.(x[2,:])
φ₂ = deg2rad.(y[2,:])
Δφ = φ₂ .- φ₁'
# haversine formula
a = sin.(Δφ./2).^2 .+ cos.(φ₂) .* cos.(φ₁') .* sin.(Δλ./2) .^ 2
# distance on the sphere
return 2 .* dist.radius .* asin.(min.(sqrt.(a), fill!(similar(a), 1))) # take care of floating point errors
end
```

While I don’t consider any deep optimization, the performance boosting looks huge:

```
using BenchmarkTools
hd = Haversine(100)
tx2 = randn(2, 10000)
cu_tx2 = cu(tx2)
# precompile and ensure the correctness
pairwise(hd, tx2, tx2) ≈ collect(pairwise(hd, cu_tx2, cu_tx2))
# true
@benchmark pairwise(hd, tx2, tx2)
#=
BenchmarkTools.Trial:
memory estimate: 762.94 MiB
allocs estimate: 3
--------------
minimum time: 3.525 s (0.04% GC)
median time: 3.532 s (0.03% GC)
mean time: 3.532 s (0.03% GC)
maximum time: 3.539 s (0.03% GC)
--------------
samples: 2
evals/sample: 1
=#
@benchmark pairwise(hd, cu_tx2, cu_tx2)
#=
BenchmarkTools.Trial:
memory estimate: 17.45 KiB
allocs estimate: 505
--------------
minimum time: 1.789 ms (0.00% GC)
median time: 39.941 ms (0.00% GC)
mean time: 39.150 ms (0.90% GC)
maximum time: 87.350 ms (2.36% GC)
--------------
samples: 128
evals/sample: 1
=#
```