What is the recommended type <: Integer to use when doing index arithmetics?

Lian_Yunlong · July 21, 2018, 2:33pm

Hi,

I have to compute the linear index in a CUDAnative kernel. I went to the source code pointer.jl in CUDAnative.jl but wasn’t able to get a clue. It seems to me that unsafe_cached_load deals with integers carefully, but it is a delicate issue.

My question is about the recommended type of integer to pass TO THE CUDANATIVE KERNEL for computation. For example, in the kernel I am currently optimizing, I pass the IJV-representation of a sparse matrix directly into the kernel to instruct the global memory access via the index list I and J, which are obtained from findnz. At first, I naively used Int64, but after profiling I found this was a bit stupid. According to indexing.jl line 41-61

for dim in (:x, :y, :z)
    # Thread index
    fn = Symbol("threadIdx_$dim")
    intr = Symbol("tid.$dim")
    @eval @inline $fn() = Int(_index($(Val(intr)), $(Val(0:max_block_size[dim]-1)))) + 1

    # Block size (#threads per block)
    fn = Symbol("blockDim_$dim")
    intr = Symbol("ntid.$dim")
    @eval @inline $fn() = Int(_index($(Val(intr)), $(Val(1:max_block_size[dim]))))

    # Block index
    fn = Symbol("blockIdx_$dim")
    intr = Symbol("ctaid.$dim")
    @eval @inline $fn() = Int(_index($(Val(intr)), $(Val(0:max_grid_size[dim]-1)))) + 1

    # Grid size (#blocks per grid)
    fn = Symbol("gridDim_$dim")
    intr = Symbol("nctaid.$dim")
    @eval @inline $fn() = Int(_index($(Val(intr)), $(Val(1:max_grid_size[dim]))))
end

I think it would better be Int. On the other hand, according tounsafe_cached_load in pointer.jl (which is traced back from ldg() ), I think I could use anything I am sure that when @cuda generates the kernel code, it won’t convert types as it thinks, it translate the code literally. So if I pass the I, J, V lists to the kernel, the actual type involved in the index arithmetics depends on eltype(I) and eltype(J). Then what is the recommended type to choose for less type conversions and (probably) better performance?

Update 1:

By index arithmetics I mean something like

P_LIN = (iωP-1+(blockIdx().x-1)*dimΩP)*dimH*dimH
Q_LIN = (iωQ-1+(blockIdx().x-1)*dimΩQ)*dimH*dimH
X_LIN = (blockIdx().y-1)*dimX1
Y_LIN = (blockIdx().z-1)*dimY1

Update 2:

According to this post,

if I am sure about the actual range of P_LIN, Q_LIN, X_LIN, Y_LIN I’d better use UInt32, right? Can this actually invoke the correct “Integer Instructions” (I guess it is IADD32I) ?

JeffreySarnoff · July 21, 2018, 7:33pm

For portability over distinct sorts of graphics processors and in service of performance, you want to use a 32-bit integer type. Whether or not that is better given as an UInt32 or an Int32 is going to be dependant on the way these things are handled from the gpu side rather than being something to optimize for Julia. We are very good about specializing on e.g. is it an Int32 or a UInt32 for purposes of dispatch. If the gpu is munching on unsigned indices, give it UInt32s and vice-versa.

Do not use an Int or a UInt for this purpose. They are both 32-bit and 64-bit types. You want more specificity when interacting with gpus (irrespective of the host processor’s bitwidth and also irrespective of the gpu’s nominal bitwidth for nonvectorized, nonsimd instructions).

Lian_Yunlong · July 21, 2018, 9:02pm

Thanks Jeffrey.
I realized that UInt is tricky. UInt32 makes the code portable, that’s what I want to use.

btw. the diplomatic language is quite Nvidian …

maleadt · July 24, 2018, 6:31am

Generally, you use Int.

unsafe_cached_load is a low-level function, but accepts all integers because of the conversion on the last line, Int(i-one(i)). In most cases however, these functions are called via indexing, eg. getindex → unsafe_load, but in the case of ldg that isn’t integrated yet. We probably need a separate device array type, a typevar, or a getindex kwarg.

Note that using plain Int’s might result in slightly lower performance, but I’ve decided to go down that road in anticipation of Cassette being able to rewrite Int to eg. Int32. See Int64 literals vs Int32 constants: avoid conversions & checks · Issue #74 · JuliaGPU/CUDA.jl · GitHub. At the same time, I’ve seen LLVM being better able to optimize redundant -1/+1 arithmetic away when everything is just an integer, so YMMV.

Topic		Replies	Views
CUDAnative: kernel multidimensional access GPU cudanative	3	1160	February 3, 2017
CUDAnative: simple and nice way to define type for all integer/float literals?; size() returning UInt32? GPU	5	787	July 12, 2019
Casting, annotations and numeric types for CUDAnative GPU type , parametric-types	5	1438	January 21, 2019
Is there a `CartesianIndex` using `Int32`? Performance gpu , indexing	8	241	May 20, 2024
Base function in Cuda kernels General Usage cudanative , cuda	8	3193	March 15, 2019

What is the recommended type <: Integer to use when doing index arithmetics?

Related topics