Allocations when indexing (Abstract)Arrays with large dimension

I’ve been working with tensors based on AbstractArrays (see https://gist.github.com/AlexanderNenninger/461f37315e45071a8c91b18d73901431 ) that map Base.getindex(A, i...) to a function call A.f(i).

While the individual indices have a small size (usually 2-8), the dimension of the array/tensor (number of indices) is quite large - up to 200. For dimensions >32 I run into some inference limits that lead to massive allocations, maybe related to this: https://github.com/JuliaLang/julia/pull/27398 ?
This seems to affect every part of the AbstractArray interface, i.e. inference of CartesianIndices

i::Any = CartesianIndices(ntuple(_ -> 2, Val(n::Core.Const(34))::Val{34})::NTuple{34, Int64})::Any

even if I try to help with a function barrier:

i::CartesianIndices{34, NTuple{34, Base.OneTo{Int64}}} = create_cart(Val(n::Core.Const(34))::Val{34})::CartesianIndices{34, NTuple{34, Base.OneTo{Int64}}}
return i::CartesianIndices{34, NTuple{34, Base.OneTo{Int64}}}[8]::Any

The same goes for reshape.

If I switch to linear indexing, reshaping to a matrix works and accesses are free of allocation,
but converting the linear index back to a cartesian one (which I do need in the end)
uses splats and I run into the compiler limits for Core._apply_iterate again ( https://github.com/JuliaLang/julia/blob/197180d8589ad14fc4bc4c23782b76739c4ec5a4/base/compiler/types.jl#L115C28-L115C28 ?) and everything slows to a crawl.

Long story short, can anybody think of a way to

  1. do a _ind2sub-like routine without hitting the splatting limits
  2. or change the limits for a limited section of code?

Many thanks in advance, and please let me know if this is better suited for Internals & Design and I’ll find a MWE and post it there!

1 Like

Chuckling because I didn’t bother to check you were doing lazy arrays, so I immediately tried to make a 33-axes Array and hit an OutOfMemoryError. Limiting every axis to length 1 works though, and I can corroborate what you’re seeing. getindex is stable if the indices arguments are a splatted sequence of integers getindex(A, fill(1, 33)...) (the splatting itself allocates though), but not if it’s a CartesianIndex with length > 32, specifically getindex(A, first(CartesianIndices(A)) ). With A::Array, I’m seeing the compiler give up on inferring the concrete integer tuple type returned by Base.to_indices, just Tuple.

Yeah, that’s precisely why this isn’t an optimized path here.

The issue seems specifically with a map over a long Tuple, whereas a broadcast operation is type-inferred. Using a broadcast in CartesianIndices construction and indexing may improve the inference in this case (although I’m unsure if that’s a good idea in general).

Indexing assumes homogeneous integer tuples, right, could SArrays potentially get around this?

I was going to respond “surely not, an SVector is just an NTuple”, but as it turns out…

julia> function f2(x)
       return map(i -> i + 1, x)
       end
f2 (generic function with 1 method)

julia> @code_warntype f2(ntuple(_ -> 0, Val(64)))
MethodInstance for f2(::NTuple{64, Int64})
  from f2(x) @ Main REPL[64]:1
Arguments
  #self#::Core.Const(f2)
  x::NTuple{64, Int64}
Locals
  #13::var"#13#14"
Body::Tuple
1 ─      (#13 = %new(Main.:(var"#13#14")))
│   %2 = #13::Core.Const(var"#13#14"())
│   %3 = Main.map(%2, x)::Tuple
└──      return %3


julia> @code_warntype f2(@SVector(zeros(Int, 64)))
MethodInstance for f2(::SVector{64, Int64})
  from f2(x) @ Main REPL[64]:1
Arguments
  #self#::Core.Const(f2)
  x::SVector{64, Int64}
Locals
  #13::var"#13#14"
Body::SVector{64, Int64}
1 ─      (#13 = %new(Main.:(var"#13#14")))
│   %2 = #13::Core.Const(var"#13#14"())
│   %3 = Main.map(%2, x)::SVector{64, Int64}
└──      return %3

Apparently map works better for SVector than for NTuple, but SVectors are dispatch as Vectors so I can’t use them for CartesianIndices.

BUT:
I read up on generated functions and this lets me circumvent the splatting limit in ind2sub for linear indexes:

@generated function ind2sub(t::NTuple{N,Int}, i::Int) where {N}
    ex = :(i -= 1)
    for j in 1:N
        Mj = Base.Cartesian.inlineanonymous(:M, j)
        ex = quote
            $ex
            $Mj = mod(i, t[$j]) + 1
            i = i ÷ t[$j]
        end
    end
    Mjs = [Base.Cartesian.inlineanonymous(:M, j) for j in 1:N]
    ex = quote
        $ex
        return $(Expr(:tuple, Mjs...))
    end
    return ex
end

This manually unrolls ind2sub and is reasonably fast! I’m not sure how much faster/slower it is than cartesian indexing though.

1 Like

The map methods in tuple.jl aggressively inlines the result tuple (f(t[1]), f(t[2]), f(t[3])), ... up to 31 elements, beyond which there is a method for the alias Base.Any32 that just makes a vector and splats it into a tuple, which does not preserve the length. The reason is to “avoid code blowup”, presumably meaning compiling for every concrete Tuple type with progressively longer recursively unrolled loop code.

StaticArrays mapreduce.jl on the other hand has a generated function _map that fixes the size, but it is only dispatched to if any of the first two array arguments are StaticArrays. For example, map((a, b, i) -> i+1, zeros(Int, 64), zeros(Int, 64), @SVector(zeros(Int, 64))) does Base.map and makes a Vector.

I wonder if all this work wouldn’t be needed if we could make an undef NTuple and initialize it in a loop that doesn’t get unrolled, exposed to us as an unsafe macro block that disallows assignments to other variables so there’s no capability for actual mutability. Or perhaps we can use Accessors.jl to semantically construct new tuples with more initialized elements, but the compiler turns that into a loop over 1 sequence.