Yeah, it’s doable. You should inspect the LLVM IR to find out if there’s still any exceptions or bad conversions lurking. For example, it happens easily that you’re accidentally promoting to Int64 or Float64, inflating register usage. Have a look at https://github.com/JuliaComputing/Training/blob/master/AdvancedGPU/2-2-kernel_analysis_optimization.ipynb; you can inspect the number of registers by compiling the kernel with launch=false and calling CUDA.registers on it.
1 Like