# Thinking through distance matrix calculation

#1

I got a workstation with 2 P5000 cards in it and I want to check out CUDAnative.jl (or other options people can suggest). I understand why GPUs can produce faster results for brute force calculations, but this is my first try…any help appreciated (and I swear this isn’t homework )

I’m trying to calculate a distance matrix, where I use the haversine formula and generated values for lat/lon. Here is the fastest I can figure out in Julia (serially):

``````#Generate reasonable data
lat = rand(Float32, 10000) .* 45
lon = rand(Float32, 10000) .* -120

#https://github.com/quinnj/Rosetta-Julia/blob/master/src/Haversine.jl
haversine(lat1::Float32,lon1::Float32,lat2::Float32,lon2::Float32) = 2 * 6372.8 * asin(sqrt(sind((lat2-lat1)/2)^2 + cosd(lat1) * cosd(lat2) * sind((lon2 - lon1)/2)^2))

function pairwise_dist(lat::Array{Float32}, lon::Array{Float32})

#Pre-allocate, since size is known
n = length(lat)
rowresult = Array{Float32}(n, n)

#Brute force fill in each cell
for i in 1:n
for j in 1:n
@inbounds rowresult[i, j] = haversine(lat[i], lon[i], lat[j], lon[j])
end
end

return rowresult

end

@time native_julia_cellwise = pairwise_dist(lat, lon)
6.173783 seconds (6 allocations: 381.470 MiB, 1.64% gc time)
``````

I fill in cells one at a time into a pre-allocated matrix, which seems pretty memory efficient and looks reasonable using @code_warntype. My idea for writing the kernel was that I’d take a single point, calculate the distances to every other point as an array (this is where the 1000s of threads on GPU would give me the speed-up):

``````function pairwise_dist_row(lat::Array{Float32}, lon::Array{Float32})

#Pre-allocate, since size is known
n = length(lat)
rowresult = Array{Float32}(n, n)

#Do whole column at a time, then write
#zip kills performance? what cause the allocations?
for i in 1:n
temp = map(x -> haversine(lat[i], lon[i], x[1], x[2]), zip(lat, lon))
@inbounds rowresult[:, i] = temp
end

return rowresult

end

@time native_julia_rowwise = pairwise_dist_row(lat, lon)
13.978353 seconds (500.25 M allocations: 8.575 GiB, 4.58% gc time)
``````

As you can see, the performance of the “distance between one-point and entire row” is twice as bad, with tons of allocations and memory usage.

### Questions:

1. Am I thinking about this problem correctly from a GPU standpoint? I replaced the inner ‘j’ loop with a map; is there a way I can do the distance calculations without the remaining loop that chooses the single point?

2. What is the correct way to think about loops in the context of the GPU? If I’m thinking about this example correctly (https://github.com/JuliaGPU/CUDAnative.jl/blob/master/examples/peakflops.jl), the loop happens in the kernel, and thus the 1:33 portion is serial, but the `fma` step is the part that has the massive parallelism from the GPU? Does it ever make sense to have the loop outside the kernel (i.e. controlled by Julia) in order to manage the amount of GPU memory being used?

#2

Since the above question might be a bit hard to follow what I’m saying, here’s a second try with a CUDA try:

``````using CUDAnative, CUDAdrv, Base.Test

#This is incomplete, but demonstrates the approach I took passing in a constant lat/lon and
#an entire array for GPU processing
function kernel_haversine(a, b, lat, lon, out)

i = (blockIdx().x-1) * blockDim().x + threadIdx().x

out[i] = a + b + lat[i] + lon[i]

return nothing
end

function distmat(lat::Array, lon::Array; dev::CuDevice=CuDevice(0))

#Create a context
ctx = CuContext(dev)

#Change to objects with CUDA context
n = length(lat)
d_lat = CuArray(lat)
d_lon = CuArray(lon)
d_out = CuArray(Array{Float32}(n))

#Calculate number of calculations, threads, blocks
len = n
threads = min(len, 1024)
blocks = Int(ceil(len/threads))

#Julia side accumulation of results to relieve GPU memory pressure
accum = Array{Float32}(n, n)

# run and time the test
secs = CUDAdrv.@elapsed begin
for i in 1:n
@cuda (blocks, threads) kernel_haversine(lat[i], lon[i], d_lat, d_lon, d_out)
accum[:, i] = Array(d_out)
end
end

#Clean up context
destroy(ctx)

#Return timing and bring results back to Julia
return (secs, Array(accum))

end

#Run kernel, get results
timing, result = distmat(lat, lon)

``````

My ultimate question is whether I need this for loop, or is there another trick I’m missing?

``````    for i in 1:n
@cuda (blocks, threads) kernel_haversine(lat[i], lon[i], d_lat, d_lon, d_out)
accum[:, i] = Array(d_out)
end
``````

#3

Possibly, but we haven’t got such primitives that map onto GPU calculations. Even that inner map isn’t well-supported yet, only GPUArrays.jl has taken a stab at such abstractions. But conceptually it makes sense, at least as a starting point.

There’s multiple reasons you might want to put the loop outside of the kernel: because individual kernels are independent (and thus the GPU can schedule them independently), because you’re running against launch limits, because you need device-wide synchronization, etc. But in this case, I would put the loop inside the kernel, especially because it is a regular loop, ie. it does not diverge across threads. Alternatively, you can also map the outer loop into the GPU’s execution model, because all of the grid/block/thread getters are multidimensional. But I’ll keep it simple for now.

For example, using a slightly different `haverside` function as we don’t have `sind`/`cosd` on GPU yet:

``````using CUDAnative, CUDAdrv

function haversine_gpu(lat1::Float32, lon1::Float32, lat2::Float32, lon2::Float32, radius::Float32)
c1 = CUDAnative.cospi(lat1 / 180.0f0)
c2 = CUDAnative.cospi(lat2 / 180.0f0)
dlat = lat2 - lat1
dlon = lon2 - lon1
d1 = CUDAnative.sinpi(dlat / 360.0f0)
d2 = CUDAnative.sinpi(dlon / 360.0f0)
t = d2 * d2 * c1 * c2
a = d1 * d1 + t
c = 2.0f0 * CUDAnative.asin(CUDAnative.min(1.0f0, CUDAnative.sqrt(a)))
return radius * c
end

function pairwise_dist_kernel(lat, lon, rowresult, n)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x

for j in 1:n
rowresult[i, j] = haversine_gpu(lat[i], lon[i], lat[j], lon[j] , 6372.8f0)
end
end

function pairwise_dist_gpu(lat::Vector{Float32}, lon::Vector{Float32})
# upload
lat_gpu = CuArray(lat)
lon_gpu = CuArray(lon)

# allocate
n = length(lat)
rowresult_gpu = CuArray{Float32}(n, n)

@cuda (n, 1) pairwise_dist_kernel(lat_gpu, lon_gpu, rowresult_gpu, n)

return Array(rowresult_gpu)
end
``````

There’s different things to notice here:

1. this is a quick, completely non-generic implementation that could be hugely improved
2. the `haversine_gpu` function is tightly typed in that it really sticks to using `Float32`: `Float64` is many times slower on GPU
3. the kernel is very inefficient

Although this kernel matches how you initially approached this problem (ie. keep the outer loop, map the inner one onto the GPUs parallelism), it is inefficient as it doesn’t use the GPU very well. Most importantly, it wastes a lot of memory bandwidth with redundant loads from memory. I would improve it by performing loads to shared memory across threads in a block, but that would require us to group threads in blocks which can contain far fewer threads (and thus requires a “smart” grouping).

There’s also some deficiencies in CUDAnative which make this code less nice / perform worse as well:

1. the need to prefix math functions with CUDAnative (eg. `CUDAnative.sin`)
2. passing rich objects to kernels, eg. the arrays in this example, result in memory being allocated and pointers being passed (because of Julia’s calling conventions); this synchronizes the GPU and is bad for performance (pointer loads from global memory instead of parameter memory)

For reference, the CPU implementation I used:

``````function haversine_cpu(lat1::Float32, lon1::Float32, lat2::Float32, lon2::Float32, radius::Float32)
c1 = cospi(lat1 / 180.0f0)
c2 = cospi(lat2 / 180.0f0)
dlat = lat2 - lat1
dlon = lon2 - lon1
d1 = sinpi(dlat / 360.0f0)
d2 = sinpi(dlon / 360.0f0)
t = d2 * d2 * c1 * c2
a = d1 * d1 + t
c = 2.0f0 * asin(min(1.0f0, sqrt(a)))
return radius * c
end

function pairwise_dist_cpu(lat::Vector{Float32}, lon::Vector{Float32})
# allocate
n = length(lat)
rowresult = Array{Float32}(n, n)

# brute force fill in each cell
for i in 1:n
for j in 1:n
@inbounds rowresult[i, j] = haversine_cpu(lat[i], lon[i], lat[j], lon[j] , 6372.8f0)
end
end

return rowresult
end
``````

I compared both using `BenchmarkTools`:

``````const dev = CuDevice(0)
const ctx = CuContext(dev)

# generate reasonable data
const n = 10000
const lat = rand(Float32, n) .* 45
const lon = rand(Float32, n) .* -120

using Base.Test
@test pairwise_dist_cpu(lat, lon) ≈ pairwise_dist_gpu(lat, lon)

using BenchmarkTools
@show @benchmark pairwise_dist_gpu(lat, lon)
@show @benchmark pairwise_dist_cpu(lat, lon)

destroy(ctx)
``````
``````@benchmark(pairwise_dist_gpu(lat, lon)) = Trial(1.061 s)
@benchmark(pairwise_dist_cpu(lat, lon)) = Trial(5.032 s)
``````

So still faster, despite the very naive implementation and loads of deficiencies For example, reducing the number of points to 1000 (which fits in a single block) and scheduling 1 block with `n` threads instead of the inverse, yields a much higher speed-up:

``````@benchmark(pairwise_dist_gpu(lat, lon)) = Trial(3.873 ms)
@benchmark(pairwise_dist_cpu(lat, lon)) = Trial(49.979 ms)
``````

This is on a i7-3770K vs a GeForce GTX TITAN. Of course, we’re comparing against single-threaded CPU performance, and executing a very small kernel while including all memory transfers (realistic applications would keep data much longer on the GPU, and perform more calculations), so you should never expect huge speed-ups here.

#4

Implementing that “smart” grouping:

``````function pairwise_dist_kernel(lat_ptr::Ptr{Float32}, lon_ptr::Ptr{Float32},
rowresult_ptr::Ptr{Float32}, n)
lat = CuDeviceArray(n, lat_ptr)
lon = CuDeviceArray(n, lon_ptr)
rowresult = CuDeviceArray((n, n), rowresult_ptr)

i = (blockIdx().x-1) * blockDim().x + threadIdx().x

if i <= n
for j in 1:n
@inbounds rowresult[i, j] = haversine_gpu(lat[i], lon[i], lat[j], lon[j] , 6372.8f0)
end
end
end

function pairwise_dist_gpu(lat::Vector{Float32}, lon::Vector{Float32})
# upload
lat_gpu = CuArray(lat)
lon_gpu = CuArray(lon)

# allocate
n = length(lat)
rowresult_gpu = CuArray{Float32}(n, n)

# calculate launch configuration
ctx = CuCurrentContext()
dev = device(ctx)
max_threads = attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)
threads = min(max_threads, n)
blocks = ceil(Int, n/threads)

# XXX: pass arrays as pointer to avoid synchronization / global ptr loads
@cuda (blocks, threads) pairwise_dist_kernel(pointer(lat_gpu), pointer(lon_gpu),
pointer(rowresult_gpu), n)

return Array(rowresult_gpu)
end
``````
``````@benchmark(pairwise_dist_gpu(lat, lon)) = Trial(115.260 ms)
@benchmark(pairwise_dist_cpu(lat, lon)) = Trial(5.076 s)
``````

#5

Thanks for the confirmation that I’m on the right track @maleadt and for the suggestions. Last night, I went with this, which covers some of your points:

``````function pairwise_dist(lat::Array{Float32}, lon::Array{Float32})

#Pre-allocate, since size is known
n = length(lat)
rowresult = Array{Float32}(n, n)

#Brute force fill in each cell
for i in 1:n
for j in 1:n
@inbounds rowresult[i, j] = haversine(lat[i], lon[i], lat[j], lon[j])
end
end

return rowresult

end

using CUDAnative, CUDAdrv, Base.Test

function kernel_haversine(a, b, lat, lon, out)

i = (blockIdx().x-1) * blockDim().x + threadIdx().x

#out[i] = a + b + lat[i] + lon[i]
out[i] =  2 * 6372.8 * CUDAnative.asin(CUDAnative.sqrt(CUDAnative.sind((a-lat[i])/2)^2 + CUDAnative.cosd(lat[i]) * CUDAnative.cosd(a) * CUDAnative.sind((b - lon[i])/2)^2))

return nothing
end

function distmat(lat::Array, lon::Array; dev::CuDevice=CuDevice(0))

#Create a context
ctx = CuContext(dev)

#Change to objects with CUDA context
n = length(lat)
d_lat = CuArray(lat)
d_lon = CuArray(lon)
d_out = CuArray(Array{Float32}(n))

#Calculate number of calculations, threads, blocks
len = n
threads = min(len, 1024)
blocks = Int(ceil(len/threads))

#Julia side accumulation of results to relieve GPU memory pressure
accum = Array{Float32}(n, n)

# run and time the test
secs = CUDAdrv.@elapsed begin
for i in 1:n
@cuda (blocks, threads) kernel_haversine(lat[i], lon[i], d_lat, d_lon, d_out)
accum[:, i] = Array(d_out)
end
end

#Clean up context
destroy(ctx)

#Return timing and bring results back to Julia
return (secs, Array(accum))

end

``````

I saw a 17x speedup (~593s/31s) on a 100,000 point distance calculation, using a GTX1060 6GB, which I was pretty happy with. It’s amazing that there are several more improvements I can make to squeeze out more efficiency. I had noticed that my memory was steadily increasing.

Hopefully I can get the hang of this more quickly and start to contribute to the libraries, especially with the new GPU dataframe FOSS project just announced at the NVIDIA conference.

#6

Note that GPU arrays are tied to the Julia GC, and are freed after they go out of scope on the host, so that might be your problem. That means you need to keep data that is alive on the GPU, alive on the CPU as well. Lastly, because there’s no way for us to indicate GPU memory pressure to the Julia GC, you might need to manually `finalize(arr)` at some points.

I played some more with the example, now also mapping the outer loop into the GPU’s execution model (makes the code cleaner) and introducing a shared memory buffer. Strangely, that last optimization did not improve performance much. I’d need to look at the profiling results, but I’ve got other things to do right now

For reference, here is my final kernel and calling code:

``````# _very_ naive/inefficient/ad-hoc pairwise kernel
# can be made much more efficient and generic
function pairwise_dist_kernel(lat_ptr::Ptr{Float32}, lon_ptr::Ptr{Float32},
rowresult_ptr::Ptr{Float32}, n)
lat = CuDeviceArray(n, lat_ptr)
lon = CuDeviceArray(n, lon_ptr)
rowresult = CuDeviceArray((n, n), rowresult_ptr)

i = (blockIdx().x-1) * blockDim().x + threadIdx().x
j = (blockIdx().y-1) * blockDim().y + threadIdx().y

if i <= n && j <= n
# store to shared memory
shmem = @cuDynamicSharedMem(Float32, 2*blockDim().x + 2*2*blockDim().x)
if threadIdx().y == 1
shmem[threadIdx().x] = lat[i]
shmem[blockDim().x + threadIdx().x] = lon[i]
end
if threadIdx().x == 1
shmem[2*blockDim().x + threadIdx().y] = lat[j]
shmem[2*blockDim().x + blockDim().y + threadIdx().y] = lon[j]
end
sync_threads()

# load from shared memory
lat_i = shmem[threadIdx().x]
lon_i = shmem[blockDim().x + threadIdx().x]
lat_j = shmem[2*blockDim().x + threadIdx().y]
lon_j = shmem[2*blockDim().x + blockDim().y + threadIdx().y]

@inbounds rowresult[i, j] = haversine_gpu(lat_i, lon_i, lat_j, lon_j, 6372.8f0)
end
end

function pairwise_dist_gpu(lat::Vector{Float32}, lon::Vector{Float32})
# upload
lat_gpu = CuArray(lat)
lon_gpu = CuArray(lon)

# allocate
n = length(lat)
rowresult_gpu = CuArray{Float32}(n, n)

# calculate launch configuration
threads = (32, 32)
blocks = ceil.(Int, n ./ threads)

# calculate size of dynamic shared memory
shmem = 2 * sum(threads) * sizeof(Float32)

# XXX: pass arrays as pointer to avoid synchronization / global ptr loads
@cuda (blocks, threads, shmem) pairwise_dist_kernel(pointer(lat_gpu), pointer(lon_gpu),
pointer(rowresult_gpu), n)

return Array(rowresult_gpu)
end
``````

I’ve reverted the “smart” launch configuration calculation to a hard-coded `(32,32)` (calculating the block size accordingly), which you might want to tune for your GPU. For maximally clean code, you may also remove the shared memory, using regular `lon[i]` array accesses instead, as it didn’t seem to speed up much on my system. Of course, at this point the execution time is dominated by the final transfer of memory, so there’s much less to speed-up at this point:

``````@benchmark(pairwise_dist_gpu(lat, lon)) = Trial(103.985 ms)
@benchmark(pairwise_dist_cpu(lat, lon)) = Trial(5.031 s)
``````

Out of that 100ms, the kernel takes about 20ms…

I guess the conclusion of all this is that you still need quite some experience with GPU programming to get good performance; at least it’s possible to do so from Julia now (albeit with some warts and deficiencies still). Let’s hope the higher-level libraries, a la GPUArrays.jl, can use this functionality to implement some more user-friendly functionality.