CartesianIndices performance: 3D faster than 2D?

I have encountered a strange occurence when testing the performance of CartesianIndices:

using BenchmarkTools

function test(t::Tuple)
  for _ ∈ CartesianIndices(t)
  end
end

@btime test((100,)) # 3.094 ns
@btime test((100, 100)) # 9.531 μs
@btime test((100, 100, 100)) # 7.267 μs
@btime test((100, 100, 100, 100)) # 152.958 ms

There are 2 behaviors here I’m confused about:

  1. The 3-dimensional loop takes less time than the 2-dimensional loop.
  2. The 4-dimensional loop takes over 20,000 times longer than the 3-dimensional loop despite (theoretically) only having to iterate over 100x the indices. Why does it not take 100x longer?

Any help would be appreciated!

1 Like

I don’t know what the compiler is doing but remember that your code is not doing anything and it might be optimized away.

For the first two cases, the for body is optimized away:

julia> @code_llvm test((100,))
;  @ REPL[2]:1 within `test`
define void @julia_test_568([1 x i64]* nocapture nonnull readonly align 8 dereferenceable(8) %0) #0 {
top:
;  @ REPL[2]:3 within `test`
  ret void
}
julia> @code_llvm test((100,100))
;  @ REPL[2]:1 within `test`
define void @julia_test_574([2 x i64]* nocapture nonnull readonly align 8 dereferenceable(16) %0) #0 {
top:
;  @ REPL[2]:3 within `test`
  ret void
}

Now, for test(100,100,100), the code is much larger.

By the way, my timings are:

julia> @btime test((100,))
  2.194 ns (0 allocations: 0 bytes)

julia> @btime test((100,100))
  2.194 ns (0 allocations: 0 bytes)

julia> @btime test((100,100,100))
  3.444 μs (0 allocations: 0 bytes)

julia> @btime test((100,100,100,100))
  60.121 ms (0 allocations: 0 bytes)

Paulo

1 Like

Thanks for the response! Indeed the compiler optimizing away the loop body explains all of these results. After testing the same function with @noinline and a simple addition in the loop body, I got the following:

@btime test((100,)) # 112.413 ns
@btime test((100, 100)) # 12.523 μs
@btime test((100, 100, 100)) # 1.094 ms
@btime test((100, 100, 100, 100)) # 139.926 ms

Each roughly 100 times more than the previous one as expected.

1 Like