# What is the recommended type <: Integer to use when doing index arithmetics?

Hi,

I have to compute the linear index in a CUDAnative kernel. I went to the source code `pointer.jl` in `CUDAnative.jl` but wasn’t able to get a clue. It seems to me that `unsafe_cached_load` deals with integers carefully, but it is a delicate issue.

My question is about the recommended type of integer to pass TO THE CUDANATIVE KERNEL for computation. For example, in the kernel I am currently optimizing, I pass the IJV-representation of a sparse matrix directly into the kernel to instruct the global memory access via the index list `I` and `J`, which are obtained from `findnz`. At first, I naively used `Int64`, but after profiling I found this was a bit stupid. According to `indexing.jl` line 41-61

``````for dim in (:x, :y, :z)
intr = Symbol("tid.\$dim")
@eval @inline \$fn() = Int(_index(\$(Val(intr)), \$(Val(0:max_block_size[dim]-1)))) + 1

# Block size (#threads per block)
fn = Symbol("blockDim_\$dim")
intr = Symbol("ntid.\$dim")
@eval @inline \$fn() = Int(_index(\$(Val(intr)), \$(Val(1:max_block_size[dim]))))

# Block index
fn = Symbol("blockIdx_\$dim")
intr = Symbol("ctaid.\$dim")
@eval @inline \$fn() = Int(_index(\$(Val(intr)), \$(Val(0:max_grid_size[dim]-1)))) + 1

# Grid size (#blocks per grid)
fn = Symbol("gridDim_\$dim")
intr = Symbol("nctaid.\$dim")
@eval @inline \$fn() = Int(_index(\$(Val(intr)), \$(Val(1:max_grid_size[dim]))))
end
``````

I think it would better be `Int`. On the other hand, according to`unsafe_cached_load` in `pointer.jl` (which is traced back from `ldg()` ), I think I could use anything I am sure that when `@cuda` generates the kernel code, it won’t convert types as it thinks, it translate the code literally. So if I pass the `I`, `J`, `V` lists to the kernel, the actual type involved in the index arithmetics depends on `eltype(I)` and `eltype(J)`. Then what is the recommended type to choose for less type conversions and (probably) better performance?

Update 1:

By index arithmetics I mean something like

``````P_LIN = (iωP-1+(blockIdx().x-1)*dimΩP)*dimH*dimH
Q_LIN = (iωQ-1+(blockIdx().x-1)*dimΩQ)*dimH*dimH
X_LIN = (blockIdx().y-1)*dimX1
Y_LIN = (blockIdx().z-1)*dimY1
``````

Update 2:

According to this post,
https://devtalk.nvidia.com/default/topic/994172/cuda-programming-and-performance/how-to-tell-if-gpu-cores-are-actually-32-64-bit-processors/post/5086094/#5086094
if I am sure about the actual range of `P_LIN`, `Q_LIN`, `X_LIN`, `Y_LIN` I’d better use `UInt32`, right? Can this actually invoke the correct “Integer Instructions” (I guess it is `IADD32I`) ?

For portability over distinct sorts of graphics processors and in service of performance, you want to use a 32-bit integer type. Whether or not that is better given as an UInt32 or an Int32 is going to be dependant on the way these things are handled from the gpu side rather than being something to optimize for Julia. We are very good about specializing on e.g. is it an Int32 or a UInt32 for purposes of dispatch. If the gpu is munching on unsigned indices, give it UInt32s and vice-versa.

Do not use an Int or a UInt for this purpose. They are both 32-bit and 64-bit types. You want more specificity when interacting with gpus (irrespective of the host processor’s bitwidth and also irrespective of the gpu’s nominal bitwidth for nonvectorized, nonsimd instructions).

Thanks Jeffrey.
I realized that `UInt` is tricky. `UInt32` makes the code portable, that’s what I want to use.

btw. the diplomatic language is quite Nvidian …

Generally, you use Int.

`unsafe_cached_load` is a low-level function, but accepts all integers because of the conversion on the last line, `Int(i-one(i))`. In most cases however, these functions are called via indexing, eg. `getindex``unsafe_load`, but in the case of `ldg` that isn’t integrated yet. We probably need a separate device array type, a typevar, or a getindex kwarg.

Note that using plain Int’s might result in slightly lower performance, but I’ve decided to go down that road in anticipation of Cassette being able to rewrite Int to eg. Int32. See Int64 literals vs Int32 constants: avoid conversions & checks · Issue #74 · JuliaGPU/CUDA.jl · GitHub. At the same time, I’ve seen LLVM being better able to optimize redundant -1/+1 arithmetic away when everything is just an integer, so YMMV.