I32 indexing

MrBelette · March 8, 2025, 12:53am

I’m currently exploring using different precision (fp32, fp64, doublefloat) calculations in the financial domain using RTX 4090 gpus. These gpus have great fp32 performance but fp64 performance isn’t so hot due to there being 1/64 the number of fp64 cores compared to fp32. One natural optimization I wanted to look at was the impact of using i32 indexing. I’m a noob wrt the internals of CUDA, GPUCompiler and LLVM but I took this as an opportunity to explore some of the “magic” that happens here. I’m hoping that I’ve understood things correctly.

I saw that @maleadt had implemented 32bit device arrays in one CUDA.jl branch (https://github.com/JuliaGPU/CUDA.jl/pull/1895) but certain issues arose. For example, https://github.com/JuliaLLVM/LLVM.jl/pull/342.
As @maleadt mentions, the InstCombine optimization pass amends the getelementptr instruction to meet the pointer specification in the datalayout string. You can see this happening in visitGetElementPtrInst in https://llvm.org/doxygen/InstructionCombining_8cpp_source.html.
Furthermore, the pass will issue a cast instruction if the index type doesn’t match the datalayout specification.

Correct me if I’m wrong, but doesn’t this suggest that it would only be strictly necessary to change the llvm_datalayout string in order to force 32 bit array indexing, and also that mixing 32 and 64 bit indexing isn’t possible without major changes? The latter is probably not recommended in any case.

It also suggests that care may be needed with the types of indexing variables in order to avoid the possibility of extra cast instructions being emitted. However, this impact may be small - it doesn’t seem to affect the run times of my applications by very much.

I also looked at the impact on run times with the addition of an index type to CuDeviceArray. The impact of using i32 vs i64 device arrays seemed minimal in my application.

My conclusion from this was that only changing the pointer index type made a real difference in performance. For me it was a difference of 10-15%, although I wasn’t very strict with how timings were made. I confess that greater performance gains were made by reducing thread divergence in warps (in my case simply by sorting input data).

In any case, I was wondering if it made sense to include a switch in the cuda macro that forced 32 bit indexing? The implementation would simply amend the llvm_datalayout string in the ptx.jl file of GPUCompiler.jl.

maleadt · March 10, 2025, 7:54am

Integer performance is mostly unrelated to floating-point performance. I can’t find hard numbers, but I wouldn’t expect i64 throughput to be 1/64’th of i32 performance.

There’s lots of hard-coded assumptions in Julia that result in indexing machinery exclusively using Int which defaults to 64-bits on any system CUDA supports. Making sure CuDeviceArray uses 32-bit indices consistently would probably require fixing or redefining lots of additional code paths; it’s not a simple toggle. So the change, right now, is probably not indicative of what using 32-bit indices throughout would accomplish.

It may be possible to re-land the data layout change we attempted in PTX: Default to 32-bit indexing of pointers. by maleadt · Pull Request #444 · JuliaGPU/GPUCompiler.jl · GitHub. IIRC the issues related to that should be fixed in recent LLVM versions, but care should be taken as this deviates from the recommended (default) data layout, potentially exposing additional issues in LLVM.

MrBelette · March 19, 2025, 12:12am

Thanks for your reply Tim.

I’ve investigated a little deeper and even though I can generate SASS code that has fewer instructions and registers, I don’t see any performance gains from forcing 32 bit index calculations. I’m wondering if it’s due to the gpu I’m using, or differences in instruction timing and execution paths. Or perhaps we just shouldn’t expect much performance gain. After all, after running through llvm, ptx and sass code optimisers, how much gain can be achieved for an address offset calculation in 32bit vs 64 bit? It all ends up as e.g. ix+iy+iz+constant (for a 3d matrix, where ix, iy, iz are int32 gpu indexes).
I also tried generating kernels that resulted in register spilling, and the reduction in registers when forcing 32 bit indexing wasn’t large enough to prevent register spilling - in my examples at least.

Just to be clear, I didn’t only change the data layout in my testing. I used your experimental CuDeviceArray{T,N,A,I} type and the other 32 bit indexing changes. However, I changed the allocation of integer type to use number of elements rather than byte size of array. I also changed the maxsize and len fields to be Int rather than I, to avoid possible overflow errors. Hopefully, I’ve interpreted things correctly. With these changes, I observe ptx code without early 64 bit conversion and subsequent 64 bit arithmetic. e.g. 32 bit indexing:

mad.lo.s32      %r36, %r13, %r7, %r5;
        mad.lo.s32      %r37, %r36, %r12, %r3;
        mul.wide.s32    %rd16, %r37, 4;
        add.s64         %rd17, %rd4, %rd16;
        ld.global.f32   %f1, [%rd17];
        add.s32         %r38, %r7, -2;
        mad.lo.s32      %r39, %r16, %r38, %r5;
        add.s32         %r40, %r39, -1;
        mad.lo.s32      %r41, %r40, %r15, %r3;
        add.s32         %r42, %r41, -1;
        mul.wide.s32    %rd18, %r42, 4;
        add.s64         %rd19, %rd7, %rd18;
        ld.global.f32   %f2, [%rd19];

vs 64 bit indexing:

cvt.s64.s32     %rd16, %r12;
        cvt.u64.u32     %rd17, %r3;
        cvt.s64.s32     %rd18, %r13;
        cvt.u64.u32     %rd19, %r5;
        cvt.u64.u32     %rd20, %r7;
        mul.lo.s64      %rd21, %rd18, %rd20;
        add.s64         %rd22, %rd21, %rd19;
        mul.lo.s64      %rd23, %rd22, %rd16;
        add.s64         %rd24, %rd23, %rd17;
        shl.b64         %rd25, %rd24, 2;
        add.s64         %rd26, %rd4, %rd25;
        ld.global.f32   %f1, [%rd26];
        cvt.s64.s32     %rd27, %r15;
        add.s32         %r36, %r3, -1;
        cvt.u64.u32     %rd28, %r36;
        cvt.s64.s32     %rd29, %r16;
        add.s32         %r37, %r5, -1;
        cvt.u64.u32     %rd30, %r37;
        add.s32         %r38, %r7, -2;
        cvt.u64.u32     %rd31, %r38;
        mul.lo.s64      %rd32, %rd29, %rd31;
        add.s64         %rd33, %rd32, %rd30;
        mul.lo.s64      %rd34, %rd33, %rd27;
        add.s64         %rd35, %rd34, %rd28;
        shl.b64         %rd36, %rd35, 2;
        add.s64         %rd37, %rd7, %rd36;
        ld.global.f32   %f2, [%rd37];

Of course, the ptx code gets converted into optimized SASS code, and that’s really all that matters.
For the toy kernel:

function testLargeIndexSize!(A)
    ix = (blockIdx().x-1i32) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1i32) * blockDim().y + threadIdx().y
    iz = (blockIdx().z-1i32) * blockDim().z + threadIdx().z
    
    nx, ny, nz = size(A)

    if ix <= nx && iy <= ny && iz <= nz
        @inbounds A[ix,iy,iz] = @inbounds A[ix,iy,iz] + 1.0f0
    end
    return
end

I show the following comparison of SASS code for 64 bit vs 32 bit indexing:

Despite fewer instructions and registers, run times under 32 bit and 64 bit indexing were the same.
Have I missed something important here?

Edit: It looks like there are 3 instruction paths; the uniform data path, the integer path and the fma path. Different instructions have different throughput and latencies. The ptxas compiler knows these details and optimizes for speed, so it’s likely that this is all we’re seeing here.

maleadt · March 19, 2025, 5:58am

How are you benchmarking kernels? For toy kernels like that, it’s expected that you wouldn’t see a difference. The one you show here is fully memory bound, so improving the code quality is unlikely to change performance.

Improving register pressure typically only helps when that allows improving occupancy, which depends on the occupancy / register usage you started out with. And the improvements of the simpler code greatly depend on the characteristics of your kernel and your hardware (memory/compute/latency bound, hardware that does or doesn’t allow scheduling i32 and f32 ops together, etc).

I don’t think you have to bother with micro-optimizations like this unless you have a very performance-sensitive kernel. And even then it’s recommended to start out with an analysis by NSight Compute, and follow its guidelines.

MrBelette · March 20, 2025, 3:50pm

Thanks Tim, yeah, that’s a pretty important thing to have missed. Just benchmarking memory throughput I guess.
I actually used that toy kernel to test a customization I had made to the cuda macro to support selecting 32 bit indexing. I wanted to ensure that the custom switch was overridden to ensure 64 bit indexing occurred whenever the number of elements > 2^32.
I used it as an example because the generated code is tiny.

A better example may be @omlins diffusion3D example (slightly amended):

function diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)
    ix = (blockIdx().x-1i32) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1i32) * blockDim().y + threadIdx().y
    T_ix_iy_izm1 = 0.0f0
    T_ix_iy_iz   = 0.0f0
    T_ix_iy_izp1 = T[ix,iy,1i32]

    nx, ny, nz = size(T2)

    for iz = 1i32:nz
        T_ix_iy_izm1   = T_ix_iy_iz
        T_ix_iy_iz     = T_ix_iy_izp1
        T_ix_iy_izp1   = iz<nz ? T[ix,iy,iz+1i32] : 0.0f0

        if (ix>1i32 && ix<nx && iy>1i32 && iy<ny && iz>1i32 && iz<nz)
            T2[ix,iy,iz] = T_ix_iy_iz + dt*(Ci[ix,iy,iz]*(
                            - ((-lam*(T[ix+1i32,iy,iz] - T_ix_iy_iz)*_dx) - (-lam*(T_ix_iy_iz - T[ix-1i32,iy,iz])*_dx))*_dx
                            - ((-lam*(T[ix,iy+1i32,iz] - T_ix_iy_iz)*_dy) - (-lam*(T_ix_iy_iz - T[ix,iy-1i32,iz])*_dy))*_dy
                            - ((-lam*(T_ix_iy_izp1 - T_ix_iy_iz)*_dz) - (-lam*(T_ix_iy_iz - T_ix_iy_izm1)*_dz))*_dz
                            ))
        end
    end
    return
end

So, at least we have array indexing varying over the z axis.
This is also going to be memory bound. NSight shows something like 12-14% compute and a possible 14% improvement if eg stalls can be reduced. OK, we can try to improve the code. But isn’t the question whether or not 32 bit indexing can improve performance by itself?
Any 32 bit indexing example is, by definition, going to access arrays. In the above example, the T matrix is accessed by the 32 bit indices ix, iy and iz.
We have a flat 64 bit address space, so all accesses will be at address &T + n*(ix+(iy+iz*ny)*nx+c), where n is the element size in bytes and c is a constant.
In the above example we access flattened arrays with offsets:

n*(ix+(iy+iz*ny)*nx)
n*(ix+(iy+iz*ny)*nx-1)
n*(ix+(iy+iz*ny)*nx+1)

The compute requirement for this seems minimal and is likely swamped by the latency of global array accesses.

If I use 32 bit indexing for this example, the llvm code always shows 32 bit operations.

define ptx_kernel void @_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0_({ i64, i32 } %state, { i8 addrspace(1)*, i64, [3 x i32], i64 } %0, { i8 addrspace(1)*, i64, [3 x i32], i64 } %1, { i8 addrspace(1)*, i64, [3 x i32], i64 } %2, float %3, float %4, float %5, float %6, float %7) local_unnamed_addr {
conversion:
  %.fca.2.0.extract37 = extractvalue { i8 addrspace(1)*, i64, [3 x i32], i64 } %0, 2, 0
  %.fca.2.1.extract38 = extractvalue { i8 addrspace(1)*, i64, [3 x i32], i64 } %0, 2, 1
  %.fca.2.2.extract39 = extractvalue { i8 addrspace(1)*, i64, [3 x i32], i64 } %0, 2, 2
  %.fca.0.extract11 = extractvalue { i8 addrspace(1)*, i64, [3 x i32], i64 } %1, 0
  %.fca.2.0.extract13 = extractvalue { i8 addrspace(1)*, i64, [3 x i32], i64 } %1, 2, 0
  %.fca.2.1.extract14 = extractvalue { i8 addrspace(1)*, i64, [3 x i32], i64 } %1, 2, 1
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i32], i64 } %2, 2, 0
  %.fca.2.1.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i32], i64 } %2, 2, 1
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %9 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %10 = mul i32 %8, %9
  %11 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %12 = add i32 %10, %11
  %13 = add i32 %12, 1
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
  %15 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
  %16 = mul nuw nsw i32 %14, %15
  %17 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  %18 = add nuw nsw i32 %16, %17
  %19 = add nuw nsw i32 %18, 1
  %20 = bitcast i8 addrspace(1)* %.fca.0.extract11 to float addrspace(1)*
  %value_phi = call i32 @llvm.smax.i32(i32 %.fca.2.2.extract39, i32 0)
  %21 = icmp slt i32 %.fca.2.2.extract39, 1
  br i1 %21, label %L477, label %L109.preheader

L109.preheader:                                   ; preds = %conversion
  %22 = mul i32 %.fca.2.0.extract13, %18
  %23 = add i32 %12, %22
  %24 = getelementptr inbounds float, float addrspace(1)* %20, i32 %23
  %25 = load float, float addrspace(1)* %24, align 4
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i32], i64 } %2, 0
  %.fca.0.extract35 = extractvalue { i8 addrspace(1)*, i64, [3 x i32], i64 } %0, 0
  %26 = icmp slt i32 %13, 2
  %27 = icmp sge i32 %13, %.fca.2.0.extract37
  %28 = icmp eq i32 %18, 0
  %29 = icmp sge i32 %19, %.fca.2.1.extract38
  %30 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
  %31 = fneg float %3
  %32 = add i32 %12, -1
  %33 = add nsw i32 %18, -1
  %34 = bitcast i8 addrspace(1)* %.fca.0.extract35 to float addrspace(1)*
  %35 = select i1 %26, i1 true, i1 %27
  %brmerge = select i1 %35, i1 true, i1 %28
  %36 = select i1 %brmerge, i1 true, i1 %29
  br label %L109

L109:                                             ; preds = %L466, %L109.preheader
  %value_phi4 = phi i32 [ %93, %L466 ], [ 1, %L109.preheader ]
  %value_phi6 = phi float [ %value_phi8, %L466 ], [ %25, %L109.preheader ]
  %value_phi7 = phi float [ %value_phi6, %L466 ], [ 0.000000e+00, %L109.preheader ]
  %.not = icmp sge i32 %value_phi4, %.fca.2.2.extract39
  br i1 %.not, label %L160, label %L115

L115:                                             ; preds = %L109
  %37 = mul i32 %value_phi4, %.fca.2.1.extract14
  %reass.add = add i32 %18, %37
  %reass.mul = mul i32 %reass.add, %.fca.2.0.extract13
  %38 = add i32 %12, %reass.mul
  %39 = getelementptr inbounds float, float addrspace(1)* %20, i32 %38
  %40 = load float, float addrspace(1)* %39, align 4
  br label %L160

L160:                                             ; preds = %L115, %L109
  %value_phi8 = phi float [ %40, %L115 ], [ 0.000000e+00, %L109 ]
  %41 = icmp ult i32 %value_phi4, 2
  %or.cond57 = select i1 %36, i1 true, i1 %41
  %brmerge58 = or i1 %or.cond57, %.not
  br i1 %brmerge58, label %L466, label %L173

L173:                                             ; preds = %L160
  %42 = add nsw i32 %value_phi4, -1
  %43 = mul i32 %42, %.fca.2.1.extract
  %reass.add43 = add i32 %18, %43
  %reass.mul44 = mul i32 %reass.add43, %.fca.2.0.extract
  %44 = add i32 %12, %reass.mul44
  %45 = getelementptr inbounds float, float addrspace(1)* %30, i32 %44
  %46 = load float, float addrspace(1)* %45, align 4
  %47 = mul i32 %42, %.fca.2.1.extract14
  %reass.add45 = add i32 %18, %47
  %reass.mul46 = mul i32 %reass.add45, %.fca.2.0.extract13
  %48 = add i32 %reass.mul46, %13
  %49 = getelementptr inbounds float, float addrspace(1)* %20, i32 %48
  %50 = load float, float addrspace(1)* %49, align 4
  %51 = fsub float %50, %value_phi6
  %52 = fmul float %51, %31
  %53 = fmul float %52, %5
  %54 = add i32 %32, %reass.mul46
  %55 = getelementptr inbounds float, float addrspace(1)* %20, i32 %54
  %56 = load float, float addrspace(1)* %55, align 4
  %57 = fsub float %value_phi6, %56
  %58 = fmul float %57, %31
  %59 = fmul float %58, %5
  %60 = fsub float %53, %59
  %61 = fneg float %60
  %62 = fmul float %61, %5
  %reass.add49 = add i32 %47, %19
  %reass.mul50 = mul i32 %reass.add49, %.fca.2.0.extract13
  %63 = add i32 %12, %reass.mul50
  %64 = getelementptr inbounds float, float addrspace(1)* %20, i32 %63
  %65 = load float, float addrspace(1)* %64, align 4
  %66 = fsub float %65, %value_phi6
  %67 = fmul float %66, %31
  %68 = fmul float %67, %6
  %reass.add51 = add i32 %33, %47
  %reass.mul52 = mul i32 %reass.add51, %.fca.2.0.extract13
  %69 = add i32 %12, %reass.mul52
  %70 = getelementptr inbounds float, float addrspace(1)* %20, i32 %69
  %71 = load float, float addrspace(1)* %70, align 4
  %72 = fsub float %value_phi6, %71
  %73 = fmul float %72, %31
  %74 = fmul float %73, %6
  %75 = fsub float %68, %74
  %76 = fmul float %75, %6
  %77 = fsub float %62, %76
  %78 = fsub float %value_phi8, %value_phi6
  %79 = fmul float %78, %31
  %80 = fmul float %79, %7
  %81 = fsub float %value_phi6, %value_phi7
  %82 = fmul float %81, %31
  %83 = fmul float %82, %7
  %84 = fsub float %80, %83
  %85 = fmul float %84, %7
  %86 = fsub float %77, %85
  %87 = fmul float %46, %86
  %88 = fmul float %87, %4
  %89 = fadd float %value_phi6, %88
  %90 = mul i32 %.fca.2.1.extract38, %42
  %reass.add53 = add i32 %18, %90
  %reass.mul54 = mul i32 %reass.add53, %.fca.2.0.extract37
  %91 = add i32 %12, %reass.mul54
  %92 = getelementptr inbounds float, float addrspace(1)* %34, i32 %91
  store float %89, float addrspace(1)* %92, align 4
  br label %L466

L466:                                             ; preds = %L173, %L160
  %.not42.not = icmp eq i32 %value_phi4, %value_phi
  %93 = add nuw i32 %value_phi4, 1
  br i1 %.not42.not, label %L477, label %L109

L477:                                             ; preds = %L466, %conversion
  ret void
}

The ptx code does also, except at the point where array addresses are required, which need to be 64 bit.

//
// Generated by LLVM NVPTX Back-End
//

.version 8.5
.target sm_89
.address_size 64

        // .globl       _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0_ // -- Begin function _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0_
                                        // @_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0_
.visible .entry _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0_(
        .param .align 8 .b8 _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_0[16],
        .param .align 8 .b8 _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_1[40],
        .param .align 8 .b8 _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_2[40],
        .param .align 8 .b8 _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_3[40],
        .param .f32 _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_4,
        .param .f32 _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_5,
        .param .f32 _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_6,
        .param .f32 _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_7,
        .param .f32 _Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_8
)
{
        .reg .pred      %p<15>;
        .reg .b32       %r<71>;
        .reg .f32       %f<51>;
        .reg .b64       %rd<26>;

// %bb.0:                               // %conversion
        ld.param.u32    %r42, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_1+24];
        setp.lt.s32     %p2, %r42, 1;
        @%p2 bra        $L__BB0_7;
// %bb.1:                               // %L109.preheader
        ld.param.v2.u32         {%r43, %r44}, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_3+16];
        ld.param.u64    %rd8, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_3];
        ld.param.v2.u32         {%r40, %r41}, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_1+16];
        ld.param.u64    %rd5, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_1];
        ld.param.f32    %f11, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_8];
        ld.param.f32    %f10, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_7];
        ld.param.f32    %f9, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_6];
        ld.param.f32    %f8, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_5];
        ld.param.f32    %f7, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_4];
        ld.param.u64    %rd1, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_2];
        ld.param.v2.u32         {%r4, %r5}, [_Z17diffusion3D_step_13CuDeviceArrayI7Float32Li3ELi1E5Int32ES2_S2_S0_S0_S0_S0_S0__param_2+16];
        mov.u32         %r46, %ctaid.x;
        mov.u32         %r47, %ntid.x;
        mul.lo.s32      %r8, %r46, %r47;
        mov.u32         %r9, %tid.x;
        add.s32         %r10, %r8, %r9;
        add.s32         %r11, %r10, 1;
        mov.u32         %r48, %ctaid.y;
        mov.u32         %r49, %ntid.y;
        mul.lo.s32      %r12, %r48, %r49;
        mov.u32         %r13, %tid.y;
        add.s32         %r14, %r12, %r13;
        add.s32         %r15, %r14, 1;
        max.s32         %r16, %r42, 0;
        mul.lo.s32      %r51, %r4, %r14;
        add.s32         %r52, %r10, %r51;
        mul.wide.s32    %rd11, %r52, 4;
        add.s64         %rd12, %rd1, %rd11;
        ld.global.f32   %f48, [%rd12];
        setp.lt.s32     %p3, %r11, 2;
        setp.ge.s32     %p4, %r11, %r40;
        setp.eq.s32     %p5, %r14, 0;
        setp.ge.s32     %p6, %r15, %r41;
        neg.f32         %f2, %f7;
        or.pred         %p7, %p3, %p4;
        or.pred         %p8, %p7, %p5;
        or.pred         %p1, %p8, %p6;
        mad.lo.s32      %r69, %r40, %r14, %r9;
        mul.lo.s32      %r18, %r41, %r40;
        mad.lo.s32      %r68, %r43, %r14, %r9;
        mul.lo.s32      %r20, %r44, %r43;
        add.s32         %r53, %r14, -1;
        mad.lo.s32      %r67, %r4, %r53, %r9;
        mul.lo.s32      %r22, %r5, %r4;
        add.s32         %r66, %r9, %r51;
        mad.lo.s32      %r65, %r4, %r15, %r9;
        add.s32         %r54, %r5, %r13;
        add.s32         %r55, %r54, %r12;
        mad.lo.s32      %r64, %r4, %r55, %r9;
        mov.f32         %f12, 0f00000000;
        mov.u32         %r70, 0;
        mov.f32         %f49, %f12;
        bra.uni         $L__BB0_2;
$L__BB0_6:                              // %L466
                                        //   in Loop: Header=BB0_2 Depth=1
        add.s32         %r69, %r69, %r18;
        add.s32         %r68, %r68, %r20;
        add.s32         %r67, %r67, %r22;
        add.s32         %r66, %r66, %r22;
        add.s32         %r65, %r65, %r22;
        add.s32         %r64, %r64, %r22;
        setp.ne.s32     %p14, %r16, %r70;
        mov.f32         %f49, %f3;
        @%p14 bra       $L__BB0_2;
        bra.uni         $L__BB0_7;
$L__BB0_2:                              // %L109
                                        // =>This Inner Loop Header: Depth=1
        mov.f32         %f3, %f48;
        add.s32         %r70, %r70, 1;
        setp.ge.s32     %p9, %r70, %r42;
        mov.f32         %f48, %f12;
        @%p9 bra        $L__BB0_4;
// %bb.3:                               // %L115
                                        //   in Loop: Header=BB0_2 Depth=1
        add.s32         %r56, %r8, %r64;
        mul.wide.s32    %rd13, %r56, 4;
        add.s64         %rd4, %rd1, %rd13;
        ld.global.f32   %f48, [%rd4];
$L__BB0_4:                              // %L160
                                        //   in Loop: Header=BB0_2 Depth=1
        setp.lt.u32     %p11, %r70, 2;
        or.pred         %p12, %p1, %p11;
        or.pred         %p13, %p12, %p9;
        @%p13 bra       $L__BB0_6;
// %bb.5:                               // %L173
                                        //   in Loop: Header=BB0_2 Depth=1
        add.s32         %r57, %r8, %r68;
        mul.wide.s32    %rd14, %r57, 4;
        add.s64         %rd15, %rd8, %rd14;
        ld.global.f32   %f14, [%rd15];
        add.s32         %r58, %r8, %r66;
        add.s32         %r59, %r58, 1;
        mul.wide.s32    %rd16, %r59, 4;
        add.s64         %rd17, %rd1, %rd16;
        ld.global.f32   %f15, [%rd17];
        sub.f32         %f16, %f15, %f3;
        mul.f32         %f17, %f16, %f2;
        mul.f32         %f18, %f17, %f9;
        add.s32         %r60, %r58, -1;
        mul.wide.s32    %rd18, %r60, 4;
        add.s64         %rd19, %rd1, %rd18;
        ld.global.f32   %f19, [%rd19];
        sub.f32         %f20, %f3, %f19;
        mul.f32         %f21, %f20, %f2;
        mul.f32         %f22, %f21, %f9;
        sub.f32         %f23, %f18, %f22;
        neg.f32         %f24, %f23;
        mul.f32         %f25, %f24, %f9;
        add.s32         %r61, %r8, %r65;
        mul.wide.s32    %rd20, %r61, 4;
        add.s64         %rd21, %rd1, %rd20;
        ld.global.f32   %f26, [%rd21];
        sub.f32         %f27, %f26, %f3;
        mul.f32         %f28, %f27, %f2;
        mul.f32         %f29, %f28, %f10;
        add.s32         %r62, %r8, %r67;
        mul.wide.s32    %rd22, %r62, 4;
        add.s64         %rd23, %rd1, %rd22;
        ld.global.f32   %f30, [%rd23];
        sub.f32         %f31, %f3, %f30;
        mul.f32         %f32, %f31, %f2;
        mul.f32         %f33, %f32, %f10;
        sub.f32         %f34, %f29, %f33;
        mul.f32         %f35, %f34, %f10;
        sub.f32         %f36, %f25, %f35;
        sub.f32         %f37, %f48, %f3;
        mul.f32         %f38, %f37, %f2;
        mul.f32         %f39, %f38, %f11;
        sub.f32         %f40, %f3, %f49;
        mul.f32         %f41, %f40, %f2;
        mul.f32         %f42, %f41, %f11;
        sub.f32         %f43, %f39, %f42;
        mul.f32         %f44, %f43, %f11;
        sub.f32         %f45, %f36, %f44;
        mul.f32         %f46, %f14, %f45;
        fma.rn.f32      %f47, %f46, %f8, %f3;
        add.s32         %r63, %r8, %r69;
        mul.wide.s32    %rd24, %r63, 4;
        add.s64         %rd25, %rd5, %rd24;
        st.global.f32   [%rd25], %f47;
        bra.uni         $L__BB0_6;
$L__BB0_7:                              // %L477
        ret;
                                        // -- End function
}

The SASS code shows 29 registers being used in the 32 bit indexing case versus 38 registers for 64 bit indexing. So at least register usage is lower. SASS code for the latter uses the uniform register file and LEA calculations, versus IMAD instructions for the former. Run times using @omlins setup don’t appear to differ by much at all between 32 bit and 64 bit indexing

Here is what NSight Compute shows at the summary level. ids 4-8 are for 64 bit indexing, ids 13-17 for 32 bit indexing:

I’m struggling to see any obvious benefit from using 32 bit indexing, but I’m happy to try out on other kernels if people can supply them.

maleadt · March 21, 2025, 10:50am

That’s not insignificant. In this case it probably doesn’t matter, since you have a simple kernel with low register pressure, but if you look at the “Impact of Varying Register Count” graph for some kernels it can matter a lot. An example I found on the internet:

The other advantage is the ability to co-execute Float32 and Int32 (array indexing) operations, but I guess that would only manifest clearly with very specific kernels.

MrBelette · March 23, 2025, 2:26pm

The problem I have with this is that reduced use of registers doesn’t always occur. And I think it’s because you end up fighting an optimizing compiler.
This may be confusing to some users who use the feature and don’t see any benefit.

As an example of what can happen, the following change to the 3D diffusion example uses explicit array index calculations. 64 bit indexing shows only 24 registers being used, but 29 registers if 32 bit indexing (by changing llvm_datalayout):

function diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)
    ix = (blockIdx().x-1i32) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1i32) * blockDim().y + threadIdx().y
    
    T_ix_iy_izm1 = 0.0f0
    T_ix_iy_iz   = 0.0f0 

    nx, ny, nz = size(T2)

    # wide multiplication not strictly needed here
    initelem = ix + widemul((iy - 1i32), nx) 
    T_ix_iy_izp1 = T[initelem]


    for iz = 1i32:nz

        # wide multiplication not strictly needed here
        izp1elem = ix + widemul((iy - 1i32 + (iz)*ny), nx)

        T_ix_iy_izm1   = T_ix_iy_iz
        T_ix_iy_iz     = T_ix_iy_izp1
        T_ix_iy_izp1   = iz<nz ? T[izp1elem] : 0.0f0

        if (ix>1i32 && ix<nx && iy>1i32 && iy<ny && iz>1i32 && iz<nz)

            elem0    = ix     + widemul((iy - 1i32 + (iz-1i32)*ny), nx)
            ixp1elem = ix + 1 + widemul((iy - 1i32 + (iz-1i32)*ny), nx)
            ixm1elem = ix - 1 + widemul((iy - 1i32 + (iz-1i32)*ny), nx)
            iyp1elem = ix     + widemul((iy        + (iz-1i32)*ny), nx)
            iym1elem = ix     + widemul((iy - 2i32 + (iz-1i32)*ny), nx)

            T2[elem0] = T_ix_iy_iz + dt*(Ci[elem0]*(
                                - ((-lam*(T[ixp1elem] - T_ix_iy_iz)*_dx) - (-lam*(T_ix_iy_iz - T[ixm1elem])*_dx))*_dx
                                - ((-lam*(T[iyp1elem] - T_ix_iy_iz)*_dy) - (-lam*(T_ix_iy_iz - T[iym1elem])*_dy))*_dy
                                - ((-lam*(T_ix_iy_izp1 - T_ix_iy_iz)*_dz) - (-lam*(T_ix_iy_iz - T_ix_iy_izm1)*_dz))*_dz
                            ))
        end
    end
    return
end

Now, if you do want to add this feature to CUDA.jl then I think it can be done safely. That is, make it an opt-in feature by using a new kwarg into the cuda macro so it doesn’t impact other code. The macro checks kernel arguments for any device array that requires 64 bit indexing and forces 64 bit indexing in that case. The value of the kwarg is passed as a new field in PTXCompilerTarget so that it can be used to affect the llvm_datalayout string.
However, given the lack of interest in this post, it doesn’t seem like many people would use it, so perhaps it’s not worth the effort?

And in any case, it is possible to force the same behaviour without this feature, if one is willing to desugar one’s code. For example, the following code (where I’ve tried to be as explicit as possible) is practically equivalent to forcing 32 bit indexing:

function diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)
    ix = (blockIdx().x-1i32) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1i32) * blockDim().y + threadIdx().y
    
    T_ix_iy_izm1 = 0.0f0
    T_ix_iy_iz   = 0.0f0
    

    nx, ny, nz = size(T2)

    initoffseti32 = (ix - 1i32 + (iy - 1i32)*nx) 
    initelementi64 = initoffseti32 +  1
    T_ix_iy_izp1 = T[initelementi64]

    for iz = 1i32:nz

        izp1off = ix - 1i32 + (iy - 1i32 + (iz)*ny)*nx
        izp1elem = izp1off + 1
        T_ix_iy_izm1   = T_ix_iy_iz
        T_ix_iy_iz     = T_ix_iy_izp1
        T_ix_iy_izp1   = iz<nz ? T[izp1elem] : 0.0f0

        if (ix>1i32 && ix<nx && iy>1i32 && iy<ny && iz>1i32 && iz<nz)

            #32 bit arithmetic works
            # but need to ensure that nx*ny*nz <= 2^32
            # otherwise you'll get memory errors
            offset0 = ix - 1i32 + (iy - 1i32 + (iz-1i32)*ny) * nx
            ixp1off = ix        + (iy - 1i32 + (iz-1i32)*ny) * nx
            ixm1off = ix - 2i32 + (iy - 1i32 + (iz-1i32)*ny) * nx
            iyp1off = ix - 1i32 + (iy        + (iz-1i32)*ny) * nx
            iym1off = ix - 1i32 + (iy - 2i32 + (iz-1i32)*ny) * nx

            #widen to 64 bits (Julia has 1-based indexing)
            elem0 = offset0 + 1
            ixp1elem = ixp1off + 1
            ixm1elem = ixm1off + 1
            iyp1elem = iyp1off + 1
            iym1elem = iym1off + 1

            #64 bit addtion of array base addresses
            T2[elem0] = T_ix_iy_iz + dt*(Ci[elem0]*(
                                - ((-lam*(T[ixp1elem] - T_ix_iy_iz)*_dx) - (-lam*(T_ix_iy_iz - T[ixm1elem])*_dx))*_dx
                                - ((-lam*(T[iyp1elem] - T_ix_iy_iz)*_dy) - (-lam*(T_ix_iy_iz - T[iym1elem])*_dy))*_dy
                                - ((-lam*(T_ix_iy_izp1 - T_ix_iy_iz)*_dz) - (-lam*(T_ix_iy_iz - T_ix_iy_izm1)*_dz))*_dz
                            ))
        end
    end
    return
end

maleadt · March 24, 2025, 8:14am

Yeah, although slightly surprising it’s not unexpected. I guess this is why some performance regressions were observed in WIP: Add an index typevar to CuDeviceArray. by maleadt · Pull Request #1895 · JuliaGPU/CUDA.jl · GitHub

Maybe not. I do have the impression that relatively few people are using CUDA.jl to squeeze out all of the available performance, but rather are happy enough to get to a reasonable amount with relatively little effort. That’s not to say it isn’t possible, as demonstrated here.

I would also be wary about adding this feature because much of the Julia array infrastructure assumes Int. That is also why WIP: Add an index typevar to CuDeviceArray. by maleadt · Pull Request #1895 · JuliaGPU/CUDA.jl · GitHub stalled. Did you have to use many additional definitions to get CuDeviceArray to behave properly, i.e., to only use 32-bit indices?

The LLVM datalayout change is a much smaller change we could consider using by default, although as expressed above it’s somewhat risky to deviate from the ones recommended in User Guide for NVPTX Back-end — LLVM 21.0.0git documentation

Good to know that at least the lowest-level getindex functions properly preserve 32-bit indexing, as we intended them to.

MrBelette · March 24, 2025, 3:04pm

That has also been my experience. I ported over some researcher’s Java code that did some variable annuity calculations and even though it’s actually a memory-bound calculation, and there are multiple global memory reads that can’t be coalesced, and with making mistakes with array indexing, and with register spilling occurring, the performance (well, at fp32 precision) is awesome. What took the researcher 2 weeks of compute time using 80 cpus (not sure how many cores) can be done on consumer-grade gpus in less than 15 minutes.
I’m very happy with CUDA.jl - you and the team have done a great job!

Maybe I’m not fully understanding the concern with Int. From what I’m seeing, you’ve already done the work to preserve Integer as much as possible. We’re compiling for a 64 bit address space, so at least one conversion is needed.

The CuDeviceArray definition I used (with some supporting methods) amends the definition from your tb/32bit_device_array branch:

struct CuDeviceArray{T,N,A,I} <: DenseArray{T,N}
    ptr::LLVMPtr{T,A}
    maxsize::Int

    dims::NTuple{N,I}
    len::Int

    # determine an index type based on the number of elements of the array.
    # this is type unstable, so only use this constructor from the host side.
    function CuDeviceArray{T,N,A}(ptr::LLVMPtr{T,A}, dims::Tuple,
                                  maxsize::Int=prod(dims)*sizeof(T)) where {T,A,N}

        #if maxsize <= typemax(Int32)
        if prod(dims)-1 <= typemax(Int32) #2^32-1
            CuDeviceArray{T,N,A,Int32}(ptr, dims, maxsize)
        else 
            CuDeviceArray{T,N,A,Int64}(ptr, dims, maxsize)
        end 

    end

    # fully typed, for use in device code
    CuDeviceArray{T,N,A,I}(ptr::LLVMPtr{T,A}, dims::Tuple,
                           maxsize::Int=prod(dims)*sizeof(T)) where {T,A,N,I} =
        new{T,N,A,I}(ptr, convert(Int, maxsize), map(I, dims), convert(Int, prod(dims)))
end

canuse32bitindex(x::Type{T}) where T = true  #only checking index type in CuDeviceArray
canuse32bitindex(x::Type{CuDeviceArray{T,N,A,I}}) where {T, N, A, I} = I === Int32

The maxsize and len fields need to be Int in order to not overflow if number of elements > 2^32.
The type of I uses prod(dims) rather than maxsize.
The helper methods are used with additional code within the cuda macro to force 64 bit index calculations if any kernel argument is a CuDeviceArray{T,N,A,Int64}.
Since you originally used maxsize = number of bytes in array, I do wonder if I’ve missed something important here. But I think that would only be relevant for 32 bit array addressing whereas we actually want to do 64-bit addressing, but with potentially limited number of elements.

Just to be clear, care is needed here because device memory errors occur if you compile using p:64:64:64:32, but the CuDeviceArray has > 2^32 elements.
I use the following definition, that uses a new field in PTXCompilerTarget and which is set in _compiler_config:

# the default datalayout does not match the one in the NVPTX user guide
llvm_datalayout(target::PTXCompilerTarget) =
    # little endian
    "e-" *
    # on 32-bit systems, use 32-bit pointers.
    # on 64-bit systems, use 64-bit pointers.
    (Int === Int64 ? "p:64:64:64" * (target.use32bitidx ? ":32-" : "-") :  "p:32:32:32-") *
    # alignment of integer types
    "i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-" *
    # alignment of floating point types
    "f32:32:32-f64:64:64-" *
    # alignment of vector types
    "v16:16:16-v32:32:32-v64:64:64-v128:128:128-" *
    # native integer widths
    "n16:32:64"

Having tried wading through the mountain of code which is LLVM, I think the risk is pretty minimal - too much code. And even if they changed it, they are most likely to default to p:64:64:64, so won’t break anything.
And by forcing opt-in for CUDA.jl users via a new kwarg in the cuda macro, it wouldn’t impact those not using 32-bit indexing. So the impact within our ecosystem is likely to be contained.

Topic		Replies	Views
Is there a `CartesianIndex` using `Int32`? Performance gpu , indexing	8	258	May 20, 2024
Why is my kernel as slow in FP32 as in FP64 on A2000 Ada-based GPU? New to Julia gpu , cuda , float , kernel , cudajl	10	175	March 11, 2025
Problem with CUDAv3 GPU	9	882	November 8, 2021
Optimizing the use of Blocks, Threads vs. Array Indexing GPU	15	3251	September 21, 2018
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	238	June 4, 2025

I32 indexing

Related topics