CUDA.jl tex2DLayered support?

jpdoane · April 19, 2021, 11:11pm

Is there a plan to integrate Layered Textures support into CUDA.jl? I am currently using a 3D texture for my application, but I suspect I am getting sub-optimal performance since my data structure really is a set of 2D textures.

I believe this would require two modifications:
-The underlying CuTextureArray should be allocated with the CUDA_ARRAY3D_LAYERED flag set. I tested this myself by hacking the alloc() call and my Texture fetch performance improved very slightly

-The tex2DLayered() device nvvm calls need to be implemented. This seems less straightforward since as far as I can tell, the layered tex calls do not seem to be in the main nvvm documentation NVVM IR :: CUDA Toolkit Documentation

Are there any thoughts whether this is feasible? And/or leads on where to find documentation for the layered calls

maleadt · April 20, 2021, 10:34am

A good way to tackle this is to create a minimal CUDA C implementation and compile it with Clang to reveal which exact intrinsics are being called. Often, there aren’t any intrinsics, and the CUDA headers contain inline PTX assembly. For example:

$ cat test.cu                                                                                                                                                                                                              
__global__ void cuda_hello(){
    printf("Hello World from thread %d!\n", threadIdx.x);
}

int main() {
    cuda_hello<<<1,1>>>(); 
    return 0;
}


$ clang++-11 --cuda-path=/opt/cuda --cuda-gpu-arch=sm_70 --cuda-device-only -emit-llvm -O0 -S test.cu -o test.ll                                                                             

$ cat test.ll                                                                                                                                                                                                              
; Function Attrs: convergent noinline norecurse nounwind optnone
define dso_local void @_Z10cuda_hellov() #1 {
  %1 = alloca %printf_args, align 8
  %2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #3, !range !6
  %3 = getelementptr inbounds %printf_args, %printf_args* %1, i32 0, i32 0
  store i32 %2, i32* %3, align 4
  %4 = bitcast %printf_args* %1 to i8*
  %5 = call i32 @vprintf(i8* getelementptr inbounds ([29 x i8], [29 x i8]* @.str, i64 0, i64 0), i8* %4)
  ret void
}

jpdoane · April 25, 2021, 1:25pm

Thank you for this great answer! I was considering writing a C++ version as a first step anyway to see if there is any performance benefit for layered textures. I finally got around to doing so and as it happens, this actually made my code slower for some reason.

In any case, I’m also curious about surface memory, and so this method for decompiling to llvm will be super helpful if I want to play around with anything else.

Topic		Replies	Views
CUDA.jl : Is the Surface Object available? GPU question , cuda	3	592	November 26, 2020
CUDA v2 - performance regression on matrix multiplication GPU	14	1727	November 10, 2020
CuTextureArray for Float64? General Usage	2	681	June 22, 2021
Demo: Example of CUDA/OpenGL interop in Julia GPU graphics , cuda	0	1760	September 26, 2022
CUDA.jl unsupported call to an unknown function, unsupported dynamic function invocation GPU	3	1246	July 11, 2022

CUDA.jl tex2DLayered support?

Related topics