Hi,
I have to compute the linear index in a CUDAnative kernel. I went to the source code pointer.jl in CUDAnative.jl but wasn’t able to get a clue. It seems to me that unsafe_cached_load deals with integers carefully, but it is a delicate issue.
My question is about the recommended type of integer to pass TO THE CUDANATIVE KERNEL for computation. For example, in the kernel I am currently optimizing, I pass the IJV-representation of a sparse matrix directly into the kernel to instruct the global memory access via the index list I and J, which are obtained from findnz. At first, I naively used Int64, but after profiling I found this was a bit stupid. According to indexing.jl line 41-61
for dim in (:x, :y, :z)
# Thread index
fn = Symbol("threadIdx_$dim")
intr = Symbol("tid.$dim")
@eval @inline $fn() = Int(_index($(Val(intr)), $(Val(0:max_block_size[dim]-1)))) + 1
# Block size (#threads per block)
fn = Symbol("blockDim_$dim")
intr = Symbol("ntid.$dim")
@eval @inline $fn() = Int(_index($(Val(intr)), $(Val(1:max_block_size[dim]))))
# Block index
fn = Symbol("blockIdx_$dim")
intr = Symbol("ctaid.$dim")
@eval @inline $fn() = Int(_index($(Val(intr)), $(Val(0:max_grid_size[dim]-1)))) + 1
# Grid size (#blocks per grid)
fn = Symbol("gridDim_$dim")
intr = Symbol("nctaid.$dim")
@eval @inline $fn() = Int(_index($(Val(intr)), $(Val(1:max_grid_size[dim]))))
end
I think it would better be Int. On the other hand, according tounsafe_cached_load in pointer.jl (which is traced back from ldg() ), I think I could use anything
I am sure that when @cuda generates the kernel code, it won’t convert types as it thinks, it translate the code literally. So if I pass the I, J, V lists to the kernel, the actual type involved in the index arithmetics depends on eltype(I) and eltype(J). Then what is the recommended type to choose for less type conversions and (probably) better performance?
Update 1:
By index arithmetics I mean something like
P_LIN = (iωP-1+(blockIdx().x-1)*dimΩP)*dimH*dimH
Q_LIN = (iωQ-1+(blockIdx().x-1)*dimΩQ)*dimH*dimH
X_LIN = (blockIdx().y-1)*dimX1
Y_LIN = (blockIdx().z-1)*dimY1
Update 2:
According to this post,
if I am sure about the actual range of P_LIN, Q_LIN, X_LIN, Y_LIN I’d better use UInt32, right? Can this actually invoke the correct “Integer Instructions” (I guess it is IADD32I) ?