Note that GPU arrays are tied to the Julia GC, and are freed after they go out of scope on the host, so that might be your problem. That means you need to keep data that is alive on the GPU, alive on the CPU as well. Lastly, because there’s no way for us to indicate GPU memory pressure to the Julia GC, you might need to manually finalize(arr)
at some points.
I played some more with the example, now also mapping the outer loop into the GPU’s execution model (makes the code cleaner) and introducing a shared memory buffer. Strangely, that last optimization did not improve performance much. I’d need to look at the profiling results, but I’ve got other things to do right now
For reference, here is my final kernel and calling code:
# _very_ naive/inefficient/ad-hoc pairwise kernel
# can be made much more efficient and generic
function pairwise_dist_kernel(lat_ptr::Ptr{Float32}, lon_ptr::Ptr{Float32},
rowresult_ptr::Ptr{Float32}, n)
lat = CuDeviceArray(n, lat_ptr)
lon = CuDeviceArray(n, lon_ptr)
rowresult = CuDeviceArray((n, n), rowresult_ptr)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
j = (blockIdx().y-1) * blockDim().y + threadIdx().y
if i <= n && j <= n
# store to shared memory
shmem = @cuDynamicSharedMem(Float32, 2*blockDim().x + 2*2*blockDim().x)
if threadIdx().y == 1
shmem[threadIdx().x] = lat[i]
shmem[blockDim().x + threadIdx().x] = lon[i]
end
if threadIdx().x == 1
shmem[2*blockDim().x + threadIdx().y] = lat[j]
shmem[2*blockDim().x + blockDim().y + threadIdx().y] = lon[j]
end
sync_threads()
# load from shared memory
lat_i = shmem[threadIdx().x]
lon_i = shmem[blockDim().x + threadIdx().x]
lat_j = shmem[2*blockDim().x + threadIdx().y]
lon_j = shmem[2*blockDim().x + blockDim().y + threadIdx().y]
@inbounds rowresult[i, j] = haversine_gpu(lat_i, lon_i, lat_j, lon_j, 6372.8f0)
end
end
function pairwise_dist_gpu(lat::Vector{Float32}, lon::Vector{Float32})
# upload
lat_gpu = CuArray(lat)
lon_gpu = CuArray(lon)
# allocate
n = length(lat)
rowresult_gpu = CuArray{Float32}(n, n)
# calculate launch configuration
threads = (32, 32)
blocks = ceil.(Int, n ./ threads)
# calculate size of dynamic shared memory
shmem = 2 * sum(threads) * sizeof(Float32)
# XXX: pass arrays as pointer to avoid synchronization / global ptr loads
@cuda (blocks, threads, shmem) pairwise_dist_kernel(pointer(lat_gpu), pointer(lon_gpu),
pointer(rowresult_gpu), n)
return Array(rowresult_gpu)
end
I’ve reverted the “smart” launch configuration calculation to a hard-coded (32,32)
(calculating the block size accordingly), which you might want to tune for your GPU. For maximally clean code, you may also remove the shared memory, using regular lon[i]
array accesses instead, as it didn’t seem to speed up much on my system. Of course, at this point the execution time is dominated by the final transfer of memory, so there’s much less to speed-up at this point:
@benchmark(pairwise_dist_gpu(lat, lon)) = Trial(103.985 ms)
@benchmark(pairwise_dist_cpu(lat, lon)) = Trial(5.031 s)
Out of that 100ms, the kernel takes about 20ms…
I guess the conclusion of all this is that you still need quite some experience with GPU programming to get good performance; at least it’s possible to do so from Julia now (albeit with some warts and deficiencies still). Let’s hope the higher-level libraries, a la GPUArrays.jl, can use this functionality to implement some more user-friendly functionality.