How to write device code?

question

#1

hi there

newcomer question here. suppose i am looking at some cuda code that has a function like

__device__ int binaryVal(const REAL x, const int n, const REAL* X)
{
}

where I read the __device__ thing as saying that thus function will only be called by the gpu device itself (not called directly from the cpu).

I understand that the @cuda macro has the effect of exposing a function on the gpu to the main julia context, as in

__global__ void vfStep(const parameters param, const REAL* K, const REAL* Z,
		       const REAL* P, const REAL* V0, REAL* V, REAL* G) 
{
}

so, question: is there a julian difference between __global__ and __device__? if so, how do i specify it? if not, I just say @cuda in front of all code needed on the GPU and that’s it?

thanks!


#2

You don’t need to specify which functions are device, global or host functions, just add @cuda in front of a call to what you would have declared __global__ in CUDA C.
And every other function you call from there is implicitly a __device__ function.

Note that the actual implementation might be device specific, eg. by calling GPU incompatible code or using device intrinsics, and for that reason we will probably add such an annotation in the future.


#3

I see - very helpful! so I like looking at https://github.com/JuliaGPU/CUDAnative.jl/blob/2271487e4fd493c6a66a9b9bb48110fa6f6b6a1e/examples/pairwise.jl#L85-L99

  • I specify a low-level implementation function, using appropriate data types, and using CUDAnative.cospi (for example) as std lib functions:
function haversine_gpu(lat1::Float32, lon1::Float32, lat2::Float32, lon2::Float32, radius::Float32)
  • I can specify another function, again using CuArrays, to provide an iteration over that function
function pairwise_dist_kernel(lat::CuDeviceVector{Float32}, lon::CuDeviceVector{Float32},
                              rowresult::CuDeviceMatrix{Float32}, n)
  • (notice to myself: not a single @cuda used so far!)
  • then there is an interface function, setting up blocks and thread sizes, casting standard Array{Float32} as CuArray, and eventually emitting a call to above pairwise_dist_kernel as a @cuda
  • right?

couple more things

  • Float32. you recommend that for most (all?) gpus? which characteristic of my GPU would tell me more about that?
  • we always provide an output arrray to a @cuda function (i.e we always return nothing)?
  • thanks for this great example and the excellent work! :slight_smile:

#4

Sounds mostly right; some notes:

I specify a low-level implementation function, using appropriate data types, and using CUDAnative.cospi (for example) as std lib functions:

Types aren’t necessary, so you can code generically. Also note that, with Julia, you can eg. pass CUDAnative.cospi as an argument to that function.

I can specify another function, again using CuArrays, to provide an iteration over that function

These are CuDeviceArrays, a type by CUDAnative, not to be confused by CuArray which implements the host GPU array type (as implemented by either CuArrays.jl or CUDAdrv.jl).

Depending on which type of iteration you need, you could also just broadcast or reduce a scalar function like haversine_gpu over some CuArrays, avoiding the need for any @cuda or GPU-specific code like pairwise_dist_kernel. At this point, there’s not many of those iteration functions implemented in CuArrays though, so you’ll often end up implementing your own kernels (if you keep them generic, please contribute them to CuArrays.jl :slightly_smiling_face:).

then there is an interface function, setting up blocks and thread sizes, casting standard Array{Float32} as CuArray, and eventually emitting a call to above pairwise_dist_kernel as a @cuda

It would probably make sense for your outer interface to take a CuArray and work with that directly, making it possible to compose operations without paying the cost of transferring to and from device memory.

Float32. you recommend that for most (all?) gpus? which characteristic of my GPU would tell me more about that?

Generally yes, Float64 is way slower (think 50-100x). Some high-end GPUs feature slightly better FP64 performance, but it’s still slow and better avoided.

we always provide an output arrray to a @cuda function (i.e we always return nothing)?

Kernels can’t return values. Nowadays (Julia 0.7 + latest CUDAnative) you don’t need to return nothing explicitly, as kernels are wrapped automatically.