I32 indexing

I’m currently exploring using different precision (fp32, fp64, doublefloat) calculations in the financial domain using RTX 4090 gpus. These gpus have great fp32 performance but fp64 performance isn’t so hot due to there being 1/64 the number of fp64 cores compared to fp32. One natural optimization I wanted to look at was the impact of using i32 indexing. I’m a noob wrt the internals of CUDA, GPUCompiler and LLVM but I took this as an opportunity to explore some of the “magic” that happens here. I’m hoping that I’ve understood things correctly.

I saw that @maleadt had implemented 32bit device arrays in one CUDA.jl branch (https://github.com/JuliaGPU/CUDA.jl/pull/1895) but certain issues arose. For example, https://github.com/JuliaLLVM/LLVM.jl/pull/342.
As @maleadt mentions, the InstCombine optimization pass amends the getelementptr instruction to meet the pointer specification in the datalayout string. You can see this happening in visitGetElementPtrInst in https://llvm.org/doxygen/InstructionCombining_8cpp_source.html.
Furthermore, the pass will issue a cast instruction if the index type doesn’t match the datalayout specification.

Correct me if I’m wrong, but doesn’t this suggest that it would only be strictly necessary to change the llvm_datalayout string in order to force 32 bit array indexing, and also that mixing 32 and 64 bit indexing isn’t possible without major changes? The latter is probably not recommended in any case.

It also suggests that care may be needed with the types of indexing variables in order to avoid the possibility of extra cast instructions being emitted. However, this impact may be small - it doesn’t seem to affect the run times of my applications by very much.

I also looked at the impact on run times with the addition of an index type to CuDeviceArray. The impact of using i32 vs i64 device arrays seemed minimal in my application.

My conclusion from this was that only changing the pointer index type made a real difference in performance. For me it was a difference of 10-15%, although I wasn’t very strict with how timings were made. I confess that greater performance gains were made by reducing thread divergence in warps (in my case simply by sorting input data).

In any case, I was wondering if it made sense to include a switch in the cuda macro that forced 32 bit indexing? The implementation would simply amend the llvm_datalayout string in the ptx.jl file of GPUCompiler.jl.

Integer performance is mostly unrelated to floating-point performance. I can’t find hard numbers, but I wouldn’t expect i64 throughput to be 1/64’th of i32 performance.

There’s lots of hard-coded assumptions in Julia that result in indexing machinery exclusively using Int which defaults to 64-bits on any system CUDA supports. Making sure CuDeviceArray uses 32-bit indices consistently would probably require fixing or redefining lots of additional code paths; it’s not a simple toggle. So the change, right now, is probably not indicative of what using 32-bit indices throughout would accomplish.

It may be possible to re-land the data layout change we attempted in PTX: Default to 32-bit indexing of pointers. by maleadt · Pull Request #444 · JuliaGPU/GPUCompiler.jl · GitHub. IIRC the issues related to that should be fixed in recent LLVM versions, but care should be taken as this deviates from the recommended (default) data layout, potentially exposing additional issues in LLVM.