The relu function works elementwise by returning the input that is non-negative, i.e. relu(x) = ifelse.(x .> 0, x, 0).
Does the fact that it works elementwise means I need to write a GPU kernel for it, or I can simply use CUArrays on it ?
Or perhaps I need to convert it to relu!(y,x) = begin y .= ifelse.(x .> 0, x, 0); return nothing end ?
Also, if I am writing a package and I have no idea if the user has a CPU or a specific GPU, how can I write code that works independently of the hardware, such that the user may have data in a standard Array, a CuArray, a ROCArray, oneArray or MtlArray… and she just calls the function (my function) and the computation is done on the appropriate hardware ?
The CUDA.jl have some useful docs on this; writing it in terms of broadcasting is array programming which uses the gpu (when the input is a CuArray) and if you can express the operation in terms of operations like that then you don’t need to write a kernel. So broadcasting is a good way to go (and supports other accelerators than CUDA, GPUArrays is the generic package I believe). If you do end up needing to write kernels and want to do so in a way that is generic, then KernelAbstractions.jl is the package for that.