Is there a GPU optimized `Distances` package?

yiyuezhuo · September 11, 2020, 6:53am

I wonder if there is a package which implements Distances.jl’s interface with CUDA.AbstractGPUArray type to leverage GPU acceleration. I found those issues:

https://github.com/JuliaStats/Distances.jl/issues/143

https://github.com/JuliaStats/Distances.jl/issues/137

From them, it seems that it’s impossible (improper) to introduce CUDA specific type to Distances.jl even with Requires. So maybe is there a package like DistributionsAD.jl which extends CUDA specific methods with Distances.jl’s interface?

For example, I wrote following code to speed up my Haversine distance computing:

using Distances
import Distances: pairwise
using CUDA

function pairwise(dist::Haversine, x::CUDA.AbstractGPUArray{Float32, 2}, y::CUDA.AbstractGPUArray{Float32, 2})
    # x, y: (2, n) GPU Array
    
    # longtitudes
    Δλ = deg2rad.(y[1,:] .- x[1,:]')
    
    # latitudes
    φ₁ = deg2rad.(x[2,:])
    φ₂ = deg2rad.(y[2,:])
    
    Δφ = φ₂ .- φ₁'
    
    # haversine formula
    a = sin.(Δφ./2).^2 .+  cos.(φ₂) .* cos.(φ₁') .* sin.(Δλ./2) .^ 2
    
    # distance on the sphere
    return 2 .* dist.radius .* asin.(min.(sqrt.(a), fill!(similar(a), 1))) # take care of floating point errors
end

While I don’t consider any deep optimization, the performance boosting looks huge:

using BenchmarkTools

hd = Haversine(100)

tx2 = randn(2, 10000)
cu_tx2 = cu(tx2)

# precompile and ensure the correctness
pairwise(hd, tx2, tx2) ≈ collect(pairwise(hd, cu_tx2, cu_tx2))
# true

@benchmark pairwise(hd, tx2, tx2)
#=
BenchmarkTools.Trial: 
  memory estimate:  762.94 MiB
  allocs estimate:  3
  --------------
  minimum time:     3.525 s (0.04% GC)
  median time:      3.532 s (0.03% GC)
  mean time:        3.532 s (0.03% GC)
  maximum time:     3.539 s (0.03% GC)
  --------------
  samples:          2
  evals/sample:     1
=#

@benchmark pairwise(hd, cu_tx2, cu_tx2)
#=
BenchmarkTools.Trial: 
  memory estimate:  17.45 KiB
  allocs estimate:  505
  --------------
  minimum time:     1.789 ms (0.00% GC)
  median time:      39.941 ms (0.00% GC)
  mean time:        39.150 ms (0.90% GC)
  maximum time:     87.350 ms (2.36% GC)
  --------------
  samples:          128
  evals/sample:     1
=#

aplavin · September 12, 2020, 12:08am

Does your function definition really require CUDA? I would expect that replacing CUDA.AbstractGPUArray{Float32, 2} with AbstractMatrix gives exactly the same results and performance.

yiyuezhuo · September 12, 2020, 1:02am

AbstractMatrix will override original implementation of Distances.jl (the performance will be worse than original implementation on CPU I guess). But it’s possible to just using GPUArrays to prevent using CUDA, so GPUArrays.JLArrays can be used to test code in pure CPU mode.

yiyuezhuo · September 12, 2020, 1:11am

Such a package would be useful as I have noticed someone is trying to add GPU support using similar style. If a package DistancesGPU.jl exists, one will just using it and pass a GPUArray to the normal Distances.jl function.

https://github.com/willtebbutt/Stheno.jl/commit/f4120f7429113b642b8218a74ae5dcb854057594#diff-206b99ce211ae6ceba2a8bef4c11980fR34

mcabbott · September 12, 2020, 1:49am

It’s possible that Distances should only call its methods on things fairly close to ::Array, and then methods like this for ::AbstractArray could be added there.

Yours seems to be quicker than what’s in Distances on the CPU, and can be made faster. You could probably avoid some more trig by re-writing in terms of cos(φ/2).

julia> function pairwise_bc2(dist::Haversine, x, y)
           @views Δλ = deg2rad.(y[1,:] .- x[1,:]')
           φ₁ = deg2rad.(@view x[2,:])
           φ₂ = deg2rad.(@view y[2,:])
           a = Δλ .= sin.((φ₂ .- φ₁')./2).^2 .+  (cos.(φ₂)) .* (cos.(φ₁))' .* sin.(Δλ./2) .^ 2
           return Δλ .= 2 .* dist.radius .* asin.(min.(sqrt.(a), 1))
       end;

julia> tx3 = randn(2, 2_000);

julia> @btime pairwise_bc1($hd, $tx3, $tx3); # as in Q
  122.366 ms (16 allocations: 152.68 MiB)

julia> @btime pairwise_bc2($hd, $tx3, $tx3);
  88.243 ms (5 allocations: 30.56 MiB)

julia> @btime Distances.pairwise($hd, $tx3, $tx3);
  166.760 ms (3 allocations: 30.52 MiB)

julia> Distances.pairwise(hd, tx3, tx3) ≈ pairwise_bc2(hd, tx3, tx3)
true

Topic		Replies	Views
Thinking through distance matrix calculation GPU gpu , multithreading	5	2756	May 16, 2017
String Distance Calculations on GPU Performance gpu , cuda , distances	4	2274	February 17, 2022
Anything like Distributions.jl for GPU? Statistics	15	3092	September 8, 2019
Compute pairwise distances (Distances.jl) General Usage performance , distances , tullio , loopvectorization	9	3722	February 1, 2024
Getting the benchmarked speedup from Distances.jl General Usage performance	8	559	September 7, 2020

Is there a GPU optimized `Distances` package?

Related topics