I got a workstation with 2 P5000 cards in it and I want to check out CUDAnative.jl (or other options people can suggest). I understand why GPUs can produce faster results for brute force calculations, but this is my first try…any help appreciated (and I swear this isn’t homework )
I’m trying to calculate a distance matrix, where I use the haversine formula and generated values for lat/lon. Here is the fastest I can figure out in Julia (serially):
#Generate reasonable data
lat = rand(Float32, 10000) .* 45
lon = rand(Float32, 10000) .* 120
#https://github.com/quinnj/RosettaJulia/blob/master/src/Haversine.jl
haversine(lat1::Float32,lon1::Float32,lat2::Float32,lon2::Float32) = 2 * 6372.8 * asin(sqrt(sind((lat2lat1)/2)^2 + cosd(lat1) * cosd(lat2) * sind((lon2  lon1)/2)^2))
function pairwise_dist(lat::Array{Float32}, lon::Array{Float32})
#Preallocate, since size is known
n = length(lat)
rowresult = Array{Float32}(n, n)
#Brute force fill in each cell
for i in 1:n
for j in 1:n
@inbounds rowresult[i, j] = haversine(lat[i], lon[i], lat[j], lon[j])
end
end
return rowresult
end
@time native_julia_cellwise = pairwise_dist(lat, lon)
6.173783 seconds (6 allocations: 381.470 MiB, 1.64% gc time)
I fill in cells one at a time into a preallocated matrix, which seems pretty memory efficient and looks reasonable using @code_warntype. My idea for writing the kernel was that I’d take a single point, calculate the distances to every other point as an array (this is where the 1000s of threads on GPU would give me the speedup):
function pairwise_dist_row(lat::Array{Float32}, lon::Array{Float32})
#Preallocate, since size is known
n = length(lat)
rowresult = Array{Float32}(n, n)
#Do whole column at a time, then write
#zip kills performance? what cause the allocations?
for i in 1:n
temp = map(x > haversine(lat[i], lon[i], x[1], x[2]), zip(lat, lon))
@inbounds rowresult[:, i] = temp
end
return rowresult
end
@time native_julia_rowwise = pairwise_dist_row(lat, lon)
13.978353 seconds (500.25 M allocations: 8.575 GiB, 4.58% gc time)
As you can see, the performance of the “distance between onepoint and entire row” is twice as bad, with tons of allocations and memory usage.
Questions:

Am I thinking about this problem correctly from a GPU standpoint? I replaced the inner ‘j’ loop with a map; is there a way I can do the distance calculations without the remaining loop that chooses the single point?

What is the correct way to think about loops in the context of the GPU? If I’m thinking about this example correctly (https://github.com/JuliaGPU/CUDAnative.jl/blob/master/examples/peakflops.jl), the loop happens in the kernel, and thus the 1:33 portion is serial, but the
fma
step is the part that has the massive parallelism from the GPU? Does it ever make sense to have the loop outside the kernel (i.e. controlled by Julia) in order to manage the amount of GPU memory being used?