Is there a `CartesianIndex` using `Int32`?

SteffenPL · May 20, 2024, 6:18am

The definition in Julia base/multidimensional.jl is restricted to Int64

  struct CartesianIndex{N} <: AbstractCartesianIndex{N}
      I::NTuple{N,Int}
      CartesianIndex{N}(index::NTuple{N,Integer}) where {N} = new(index)
  end

Of course, one can quickly implement (a subset of the functionality) manually for Int32, but I was wondering why it is restricted to Int64 in the first place. Or, if there exists already the multidimensional indexing stuff somewhere in another package?

oxinabox · May 20, 2024, 6:36am

It is using Int which defaults to Int32 on 32 bit systems and Int64 on 64 bit systems.
Which is generally the most reasonable thing to do, since at the end of the day you will need to end up with a pointer that matches the pointer size of the system’s hardware.

Why do you not want this?

SteffenPL · May 20, 2024, 6:40am

Mh, I am working on GPU programming for which the system is 64 bits, but the kernels should avoid using 64-bit integers as they are executed on the GPU. That’s why Int in this context is Int64.

To be concrete, it is this code: SpatialHashTables.jl/src/core.jl at 25bdc1c97c85dfad9255a9b281829fcfe1d2d48e · SteffenPL/SpatialHashTables.jl · GitHub
which is called in a setting like the following:
SpatialHashTables.jl/benchmarks/report/forcebenchmark.jl at 25bdc1c97c85dfad9255a9b281829fcfe1d2d48e · SteffenPL/SpatialHashTables.jl · GitHub

I found with benchmarks that the use of Int64 is indeed the bottleneck here. That’s why I’m looking into this.

Sukera · May 20, 2024, 10:39am

Is the GPU architecture 32 bit? In theory, if Int were truly architecture informed, it should be an Int32 there. That’s currently not possible though, since Int is hardcoded on the host side…

SteffenPL · May 20, 2024, 10:57am

I’m using KernelAbstractions.jl, all of that goes through the GPUCompiler.jl, but there it requires explicit annotation to enforce Int32. I found for this discussion which seems related:

github.com/JuliaGPU/CUDA.jl

Int64 literals vs Int32 constants: avoid conversions & checks

opened 08:23AM - 11 Jan 17 UTC

maleadt

cuda kernels performance

Many constants in CUDA world are 32-bit, eg. the warp-size, thread or block IDs …and dimensions, etc. We don't promote these to Int64 in order to avoid conversions when doing math on them, however it might be equally expensive not to do so because of conversions when doing math with literals. For example, take the following idiomatic code: ```julia function reduce_warp{F<:Function,T}(op::F, val::T)::T offset = CUDAnative.warpsize() ÷ 2 while offset > 0 val = op(val, shfl_down(val, offset)) offset ÷= 2 end return val end ``` `warpsize` yields an Int32, but gets converted and promoted to Int64 because of the `÷ 2`. This in turn causes `shf_down` which takes an Int32 do convert it back, including an exactness check + exception (trap): ```llvm julia> CUDAnative.code_llvm(reduce_warp, (typeof(+), Int32)) define i32 @julia_reduce_warp_62748(i32) local_unnamed_addr # { top: %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.warpsize() %2 = icmp slt i32 %1, 2 br i1 %2, label %L23, label %if.preheader if.preheader: ; preds = %top %3 = lshr i32 %1, 1 %4 = zext i32 %3 to i64 br label %if if: ; preds = %if.preheader, %pass2 %val.03 = phi i32 [ %9, %pass2 ], [ %0, %if.preheader ] %offset.02 = phi i64 [ %10, %pass2 ], [ %4, %if.preheader ] %sext = shl i64 %offset.02, 32 %5 = ashr exact i64 %sext, 32 %6 = icmp eq i64 %5, %offset.02 br i1 %6, label %pass2, label %fail1 L23.loopexit: ; preds = %pass2 br label %L23 L23: ; preds = %L23.loopexit, %top %val.0.lcssa = phi i32 [ %0, %top ], [ %9, %L23.loopexit ] ret i32 %val.0.lcssa fail1: ; preds = %if tail call void @llvm.trap() unreachable pass2: ; preds = %if %7 = trunc i64 %offset.02 to i32 %8 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %val.03, i32 %7, i32 31) %9 = add i32 %8, %val.03 %10 = lshr i64 %offset.02, 1 %11 = icmp eq i64 %10, 0 br i1 %11, label %L23.loopexit, label %if } ``` An improved, but less readable version of the same code goes like: ```julia function reduce_warp{F<:Function,T}(op::F, val::T)::T offset = CUDAnative.warpsize() ÷ Int32(2) while offset > Int32(0) val = op(val, shfl_down(val, offset)) offset ÷= Int32(2) end return val end ``` This yields the following, much cleaner IR: ```llvm define i32 @julia_reduce_warp_62749(i32) local_unnamed_addr #0 { top: %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.warpsize() %2 = icmp slt i32 %1, 2 br i1 %2, label %L25, label %if.preheader if.preheader: ; preds = %top br label %if if: ; preds = %if.preheader, %if %offset.03.in = phi i32 [ %offset.03, %if ], [ %1, %if.preheader ] %val.02 = phi i32 [ %4, %if ], [ %0, %if.preheader ] %offset.03 = sdiv i32 %offset.03.in, 2 %3 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %val.02, i32 %offset.03, i32 31) %4 = add i32 %3, %val.02 %5 = icmp slt i32 %offset.03.in, 4 br i1 %5, label %L25.loopexit, label %if L25.loopexit: ; preds = %if br label %L25 L25: ; preds = %L25.loopexit, %top %val.0.lcssa = phi i32 [ %0, %top ], [ %4, %L25.loopexit ] ret i32 %val.0.lcssa } ```

Benny · May 20, 2024, 11:14am

Doesn’t to_indices always take care of the conversion during indexing? Seems possible to wrangle any Integer before then.

Sukera · May 20, 2024, 12:05pm

Yes, that’s the GPU specific discussion about much the same problem. The core issue is that the size of Int is determined by the parser on the host system, not the target architecture.

In principle yes, but the problem is that the conversion might throw if the integer is larger than typemax(Int32). Because CartesianIndex mandates Int64 due to the host system being 64-bit, the error check can’t be removed since it might (from the POV of the type system) have a runtime value that encounters the error path.

Benny · May 20, 2024, 12:16pm

I’m assuming the integer wrangling occurs in smaller or equal types (values that are subsets of the values of the native integer type), seems like OP wants to leverage this for speed.

There was an old package about switching what types literals do, seems at least half-relevant here.

SteffenPL · May 20, 2024, 12:18pm

yes, let me know if I should do a minimal example, but essentially I need to iterate though a small CartesianIndex(starts:ends) on the GPU and apply the mod1 function to it. I found that a innocent index + 1 here and there already killed performance, and now I am replacing the Catesian indices altogether. (Which seems to work, but makes the code look uglier.)

The mod1 is the core function which seems to mess up performance if one uses Int64 instead of Int32.

Topic		Replies	Views
Why is LinearIndices hardcoded to use Int? General Usage	8	675	November 7, 2018
How to convert CartesianIndex{N} values to Int64 General Usage question	18	13547	March 16, 2022
Use of CartesianIndices with CUDA? GPU indexing	6	1324	September 7, 2020
Indexing with `CartesianIndex` Internals & Design question	14	1503	July 23, 2019
CuArrays.jl errors with indexing / CartesianIndices General Usage gpu , indexing , cuda	1	590	July 24, 2019

Is there a `CartesianIndex` using `Int32`?

Related topics