What is the recommended type <: Integer to use when doing index arithmetics?




I have to compute the linear index in a CUDAnative kernel. I went to the source code pointer.jl in CUDAnative.jl but wasn’t able to get a clue. It seems to me that unsafe_cached_load deals with integers carefully, but it is a delicate issue.

My question is about the recommended type of integer to pass TO THE CUDANATIVE KERNEL for computation. For example, in the kernel I am currently optimizing, I pass the IJV-representation of a sparse matrix directly into the kernel to instruct the global memory access via the index list I and J, which are obtained from findnz. At first, I naively used Int64, but after profiling I found this was a bit stupid. According to indexing.jl line 41-61

for dim in (:x, :y, :z)
    # Thread index
    fn = Symbol("threadIdx_$dim")
    intr = Symbol("tid.$dim")
    @eval @inline $fn() = Int(_index($(Val(intr)), $(Val(0:max_block_size[dim]-1)))) + 1

    # Block size (#threads per block)
    fn = Symbol("blockDim_$dim")
    intr = Symbol("ntid.$dim")
    @eval @inline $fn() = Int(_index($(Val(intr)), $(Val(1:max_block_size[dim]))))

    # Block index
    fn = Symbol("blockIdx_$dim")
    intr = Symbol("ctaid.$dim")
    @eval @inline $fn() = Int(_index($(Val(intr)), $(Val(0:max_grid_size[dim]-1)))) + 1

    # Grid size (#blocks per grid)
    fn = Symbol("gridDim_$dim")
    intr = Symbol("nctaid.$dim")
    @eval @inline $fn() = Int(_index($(Val(intr)), $(Val(1:max_grid_size[dim]))))

I think it would better be Int. On the other hand, according tounsafe_cached_load in pointer.jl (which is traced back from ldg() ), I think I could use anything :frowning: I am sure that when @cuda generates the kernel code, it won’t convert types as it thinks, it translate the code literally. So if I pass the I, J, V lists to the kernel, the actual type involved in the index arithmetics depends on eltype(I) and eltype(J). Then what is the recommended type to choose for less type conversions and (probably) better performance?

Update 1:

By index arithmetics I mean something like

P_LIN = (iωP-1+(blockIdx().x-1)*dimΩP)*dimH*dimH
Q_LIN = (iωQ-1+(blockIdx().x-1)*dimΩQ)*dimH*dimH
X_LIN = (blockIdx().y-1)*dimX1
Y_LIN = (blockIdx().z-1)*dimY1

Update 2:

According to this post,
if I am sure about the actual range of P_LIN, Q_LIN, X_LIN, Y_LIN I’d better use UInt32, right? Can this actually invoke the correct “Integer Instructions” (I guess it is IADD32I) ?


For portability over distinct sorts of graphics processors and in service of performance, you want to use a 32-bit integer type. Whether or not that is better given as an UInt32 or an Int32 is going to be dependant on the way these things are handled from the gpu side rather than being something to optimize for Julia. We are very good about specializing on e.g. is it an Int32 or a UInt32 for purposes of dispatch. If the gpu is munching on unsigned indices, give it UInt32s and vice-versa.

Do not use an Int or a UInt for this purpose. They are both 32-bit and 64-bit types. You want more specificity when interacting with gpus (irrespective of the host processor’s bitwidth and also irrespective of the gpu’s nominal bitwidth for nonvectorized, nonsimd instructions).


Thanks Jeffrey.
I realized that UInt is tricky. UInt32 makes the code portable, that’s what I want to use.

btw. the diplomatic language is quite Nvidian … :smiley:


Generally, you use Int.

unsafe_cached_load is a low-level function, but accepts all integers because of the conversion on the last line, Int(i-one(i)). In most cases however, these functions are called via indexing, eg. getindex -> unsafe_load, but in the case of ldg that isn’t integrated yet. We probably need a separate device array type, a typevar, or a getindex kwarg.

Note that using plain Int’s might result in slightly lower performance, but I’ve decided to go down that road in anticipation of Cassette being able to rewrite Int to eg. Int32. See https://github.com/JuliaGPU/CUDAnative.jl/issues/25#issuecomment-359155011. At the same time, I’ve seen LLVM being better able to optimize redundant -1/+1 arithmetic away when everything is just an integer, so YMMV.