Sounds mostly right; some notes:
I specify a low-level implementation function, using appropriate data types, and using CUDAnative.cospi (for example) as std lib functions:
Types aren’t necessary, so you can code generically. Also note that, with Julia, you can eg. pass CUDAnative.cospi
as an argument to that function.
I can specify another function, again using CuArrays, to provide an iteration over that function
These are CuDeviceArray
s, a type by CUDAnative, not to be confused by CuArray which implements the host GPU array type (as implemented by either CuArrays.jl or CUDAdrv.jl).
Depending on which type of iteration you need, you could also just broadcast or reduce a scalar function like haversine_gpu
over some CuArray
s, avoiding the need for any @cuda
or GPU-specific code like pairwise_dist_kernel
. At this point, there’s not many of those iteration functions implemented in CuArrays though, so you’ll often end up implementing your own kernels (if you keep them generic, please contribute them to CuArrays.jl ).
then there is an interface function, setting up blocks and thread sizes, casting standard Array{Float32} as CuArray, and eventually emitting a call to above pairwise_dist_kernel as a @cuda
It would probably make sense for your outer interface to take a CuArray and work with that directly, making it possible to compose operations without paying the cost of transferring to and from device memory.
Float32. you recommend that for most (all?) gpus? which characteristic of my GPU would tell me more about that?
Generally yes, Float64 is way slower (think 50-100x). Some high-end GPUs feature slightly better FP64 performance, but it’s still slow and better avoided.
we always provide an output arrray to a @cuda function (i.e we always return nothing)?
Kernels can’t return values. Nowadays (Julia 0.7 + latest CUDAnative) you don’t need to return nothing
explicitly, as kernels are wrapped automatically.