I’m currently exploring using different precision (fp32, fp64, doublefloat) calculations in the financial domain using RTX 4090 gpus. These gpus have great fp32 performance but fp64 performance isn’t so hot due to there being 1/64 the number of fp64 cores compared to fp32. One natural optimization I wanted to look at was the impact of using i32 indexing. I’m a noob wrt the internals of CUDA, GPUCompiler and LLVM but I took this as an opportunity to explore some of the “magic” that happens here. I’m hoping that I’ve understood things correctly.
I saw that @maleadt had implemented 32bit device arrays in one CUDA.jl branch (https://github.com/JuliaGPU/CUDA.jl/pull/1895) but certain issues arose. For example, https://github.com/JuliaLLVM/LLVM.jl/pull/342.
As @maleadt mentions, the InstCombine optimization pass amends the getelementptr instruction to meet the pointer specification in the datalayout string. You can see this happening in visitGetElementPtrInst in https://llvm.org/doxygen/InstructionCombining_8cpp_source.html.
Furthermore, the pass will issue a cast instruction if the index type doesn’t match the datalayout specification.
Correct me if I’m wrong, but doesn’t this suggest that it would only be strictly necessary to change the llvm_datalayout string in order to force 32 bit array indexing, and also that mixing 32 and 64 bit indexing isn’t possible without major changes? The latter is probably not recommended in any case.
It also suggests that care may be needed with the types of indexing variables in order to avoid the possibility of extra cast instructions being emitted. However, this impact may be small - it doesn’t seem to affect the run times of my applications by very much.
I also looked at the impact on run times with the addition of an index type to CuDeviceArray. The impact of using i32 vs i64 device arrays seemed minimal in my application.
My conclusion from this was that only changing the pointer index type made a real difference in performance. For me it was a difference of 10-15%, although I wasn’t very strict with how timings were made. I confess that greater performance gains were made by reducing thread divergence in warps (in my case simply by sorting input data).
In any case, I was wondering if it made sense to include a switch in the cuda macro that forced 32 bit indexing? The implementation would simply amend the llvm_datalayout string in the ptx.jl file of GPUCompiler.jl.