Hi,
I have to compute the linear index in a CUDAnative kernel. I went to the source code pointer.jl
in CUDAnative.jl
but wasn’t able to get a clue. It seems to me that unsafe_cached_load
deals with integers carefully, but it is a delicate issue.
My question is about the recommended type of integer to pass TO THE CUDANATIVE KERNEL for computation. For example, in the kernel I am currently optimizing, I pass the IJV-representation of a sparse matrix directly into the kernel to instruct the global memory access via the index list I
and J
, which are obtained from findnz
. At first, I naively used Int64
, but after profiling I found this was a bit stupid. According to indexing.jl
line 41-61
for dim in (:x, :y, :z)
# Thread index
fn = Symbol("threadIdx_$dim")
intr = Symbol("tid.$dim")
@eval @inline $fn() = Int(_index($(Val(intr)), $(Val(0:max_block_size[dim]-1)))) + 1
# Block size (#threads per block)
fn = Symbol("blockDim_$dim")
intr = Symbol("ntid.$dim")
@eval @inline $fn() = Int(_index($(Val(intr)), $(Val(1:max_block_size[dim]))))
# Block index
fn = Symbol("blockIdx_$dim")
intr = Symbol("ctaid.$dim")
@eval @inline $fn() = Int(_index($(Val(intr)), $(Val(0:max_grid_size[dim]-1)))) + 1
# Grid size (#blocks per grid)
fn = Symbol("gridDim_$dim")
intr = Symbol("nctaid.$dim")
@eval @inline $fn() = Int(_index($(Val(intr)), $(Val(1:max_grid_size[dim]))))
end
I think it would better be Int
. On the other hand, according tounsafe_cached_load
in pointer.jl
(which is traced back from ldg()
), I think I could use anything I am sure that when @cuda
generates the kernel code, it won’t convert types as it thinks, it translate the code literally. So if I pass the I
, J
, V
lists to the kernel, the actual type involved in the index arithmetics depends on eltype(I)
and eltype(J)
. Then what is the recommended type to choose for less type conversions and (probably) better performance?
Update 1:
By index arithmetics I mean something like
P_LIN = (iωP-1+(blockIdx().x-1)*dimΩP)*dimH*dimH
Q_LIN = (iωQ-1+(blockIdx().x-1)*dimΩQ)*dimH*dimH
X_LIN = (blockIdx().y-1)*dimX1
Y_LIN = (blockIdx().z-1)*dimY1
Update 2:
According to this post,
if I am sure about the actual range of P_LIN
, Q_LIN
, X_LIN
, Y_LIN
I’d better use UInt32
, right? Can this actually invoke the correct “Integer Instructions” (I guess it is IADD32I
) ?