CUDAnative is awesome!

Great, happy to hear CUDAnative has been useful and performs well :slight_smile:

I have also appreciated to use CuArrays and broadcast but I did not obtain the same level of performance (probably due to my limited skills).

Probably not; I’ve spent significant time optimizing CUDAnative and making sure the generated code quality is competitive, while CuArrays hasn’t seen much optimization…

I wonder if it would be a good idea to allow to run the CUDANative kernels on the CPU (possibly with automatic multi-threading (and maybe simd)) when Base.Arrays are used instead of CuArrays. Of course optimal CPU implementations would probably imply data layout transformation I wrote a small paper on this topic here.

I considered that in the past, but I’m not sure it’s a smart investment of our (very limited) development manpower, especially with most users nowadays relying on array abstractions where this already the case. If you’re really interested in this, it might be better to revive a project like GPUOcelot, which implements the CUDA APIs and provides a PTX->LLVM IR compiler. But it hasn’t seen any development recently.

Of course all the CUDA extensions (atomic, GPU intrinsics,…) would be a very much appreciated :wink:

Similar trade off, I prefer to work on CUDAnative but it improving CuArrays reaches more users.

5 Likes