Taking the array indexing interface seriously

I’m doing some work standardizing array interfaces for the spatial data ecosystem in
HDF5, ArchGDAL and in GeoData.

It’s not uncommon for widely use packages to have non-standard indexing inputs and outputs. HDF5.jl currently doesn’t drop dims indexed with Int, so it returns unexpected array dimensions and forces you to write a bunch of special casing code. But the required indexing behavior isn’t so easy to track down.

I keep finding weird inputs that are actually allowed in Base, like:

julia> [1, 2][2, 1, 1, 1]
2

It would be great to have a more comprehensive list of exactly what array dimensions should be returned given all allowed inputs, so that packages providing arrays or array-like data can be more easily standardized. Maybe as part of the interfaces manual?

3 Likes

You are running into getindex quietly ignoring an arbitrary number of excess 1s.

It is a wart, but it is somewhat deeply tied into a lot of things, and unlikely to go away until 2.0.

1 Like

Yeah. But its not clear what indexing behaviours should be implemented and which are artefacts that dont really matter. Thats what needs to be clarified.

You may know this already, but anything written to the https://docs.julialang.org/en/v1/manual/interfaces/#man-interface-array-1 interface gets all the indexing behavior for free—you don’t have to implement complex behaviors.

HDF5 was started long before the array interface was standardized, so the package may just need some updating. EDIT: I should acknowledge that efficient disk access patterns are different from efficient RAM access patterns, so indeed HDF5 may need some special “chunking” methods. However, that should not apply to the majority of AbstractArray types.

4 Likes

Indeed: one also gets the getindex tail behavior for free:

julia> struct MyArray{N} <: AbstractArray{Int,N}
       _size::NTuple{N,Int}
       end

julia> Base.size(m::MyArray) = m._size

julia> Base.getindex(m::MyArray{N}, I::Vararg{Int,N}) where {N} = sum(I)

julia> A = MyArray((2, 3, 4))
2×3×4 MyArray{3}:
[:, :, 1] =
A[1,2,3]
 3  4  5
 4  5  6

[:, :, 2] =
 4  5  6
 5  6  7

[:, :, 3] =
 5  6  7
 6  7  8

[:, :, 4] =
 6  7  8
 7  8  9

julia> A[1,2,3]
6

julia> A[1,2,3,1]
6

Thanks. I guess I’m mostly discussing the case where the object is not <: AbstractArray{T,N}, but implements rudimentary array-like indexing, as in HDF5Dataset. This can be hard to avoid when loading from disk or some C interface, as T and N aren’t necessarily known or fixed when the indexed object is constructed.

The answer might be that these objects need a conversion step to an object that is <: AbstractArray.

If the object is not an <:AbstractArray, then you are not bound by the rules of the latter — the implementation can do arbitrary things. Moreover, a lot of methods that require an <: AbtractArray will not accept these objects (as arrays).

I am not familiar with the internals of the above packages, but either a conversion wrapper or possibly direct support for the array interface could be beneficial.

1 Like

I am not familiar with the internals of the above packages, but either a conversion wrapper or possibly direct support for the array interface could be beneficial.

As Tim Holy mentioned, the AbstractArray interface is optimized for RAM access and not for disk access. If you tried, for example, to make a HDF5 variable a subtype of AbstractArray, this would give you a lot of behavior for free, but it would be very inefficient. A simple call to show in the REPL can take ages, because the AbstractArray interface reads every shown value through a separate call to the C interface, while the values might be compressed in chunks of a some GB size. The same is true for Colon-indexing etc.

I once suggested this https://github.com/meggart/ChunkedArrayBase.jl but it did not get a big amount of traction in other packages so far. https://github.com/JuliaIO/HDF5.jl/issues/557

I know that theoretically they can return anything from getindex, but I’m saying that in practice they probably shouldn’t.

Returning ie. a non-standard dimension array when using familiar indexing syntax is a horror show for the ecosystem. A package shouldn’t use getindex if you are going to do this, a different method should be considered, that doesn’t suggest a common set of behaviors when passed Int/AbstractArray args.

@fabiangans For sure, not being in <: AbstractArray is often necessary. But if so we should still treat getindex as a standard interface. Most properties of the returned array are orthogonal to the performance concerns of actually loading it - reshape doesn’t touch the underlying data.

There are issues like this for HDF5, and the fix isn’t controversial.

But it’s not really clear that these changes should be the accepted standard for getindex/setindex! of array like objects outside of AbstractArray.

I think the best way to deal with these issues is to implement these methods in a more efficient way. There is nothing that precludes this in the <: AbstractArray interface: the list of methods there are but a lower bound. Several types in Base specialize methods for efficiency (eg searchsortedfirst for ranges).

In particular, Base.show should never print a GB of data, the bottleneck for that will probably be IO with the IDE. It should respect print limits of the IOContext.

Looking at this again, there really is quite a lot of documentation about array behavior at https://docs.julialang.org/en/v1/manual/arrays/. When that isn’t enough, stepping into a getindex call with the debugger can help reveal the internal mechanisms.

2 Likes

Edit: Ok I didn’t read that part very well! Looks like mostly that is described. Although quirks like the one in my first post are not.

But it’s not clear, as @Tamas_Papp suggests, that array-like objects should even treat getindex and setindex! as a standard interface with predictable behavior.’

I’m mostly hoping that that can be the case, and in future situations like indexing in HDF5.jl don’t happen.

Generally I agree that types should not pun on functions from another context. Eg an FTP client and a drawing library should not use the same connect method.

But getindex and setindex! is somewhat of a special case since they have a nice custom syntax with []. So a lot of packages which want to use this syntax define methods without making their types <:AbstractArray, eg DataFrames is a prominent example. I think this is fine as long as the usage is well-documented, again DataFrames.jl is a good example of how to do this well.

1 Like

For sure, it has a lot of uses. But if the indexing looks identical to array indexing and returns an array - it really should stick to the array interface, even if it isn’t an Array itself. This includes HDF5Dataset and many similar Dataset objects that are loaded from disk in other packages.

https://docs.julialang.org/en/v1/manual/arrays/#Omitted-and-extra-indices-1 talks about “extra indices”

I really do suspect that HDF5 is a special case due to the peculiarities of disk-access patterns and its heritage dating from the Julia 0.0.0 days.

3 Likes

Right. HDF5.jl doesn’t support omitted or extra indices either…

Seems like a real <: AbstractArray object would be needed to do these things properly - maybe that’s the answer to this. Just make a wrapper that is actually <: AbstractArray whenever possible.

@fabiangans thinking about what you said a separate interface for disk-based arrays is really something that is missing here.

An AbstractDiskArray that follows AbstractArray indexing correctly, but defines more sensible baviors for disk-backed array-like data - like avoiding the issues with show or broadcast. Traits could be used for chunking? There are quite a few formats that could benefit from this. It would definitely make writing GeoData.jl a lot easier.

BTW I think some of the issues mentioned for HDF5.jl may also apply to Zarr.jl - the indexing doesn’t exactly match AbstractArray

AbstractArrays have a handful of special behaviors, but it really just boils down to these four things:

  • Extensible support for converting the passed indices to a single location or array of locations with to_indices/to_index for each dimension. This is what handles Integer conversions, CartesianIndex, logical indexing (AbstractArray{Bool}s), and indexing by :. Calling to_indices doesn’t actually require ::AbstractArray arguments, but it only gets called by default for AbstractArrays, meaning you can easily opt-in to these behaviors.
  • AbstractArrays of indices get expanded out into the Cartesian product of single-location indices, taking into account the dimensionality of each index (we colloquially call this APL or dimsum indexing).
  • Support for linear indexing.
  • Support for omitted and trailing indices.

All should be documented in the arrays chapter. The AbstractArray interface doesn’t explicitly cover all of these simply because if you <: AbstractArray you don’t need to worry about them — you get them all for free!

There’s a tension for how array-like non-AbstractArrays should be.

6 Likes

Ok that is a great summary. Maybe I’ll integrate that in the introduction paragraph in the array docs.

Using to_indices/to_index is something that isn’t happening in array-like data packages. HDF5.jl, Zarr.jl, my pull request to ArchGDAL.jl (lol), probably a few others.

Some other disk based array packages are slightly altering the indexing syntax although they are <: AbstractArray- ASDF, NetCDF, GeoArrays.

1 Like

Thanks everyone, this should lead to proper indexing for at least HDF5 https://github.com/JuliaIO/HDF5.jl/pull/603 and ArchGDAL.

The key base methods seem to be calling to_index(A, I) which then requires the method eachindex(::LinearIndices, A::MyArrayLike) and setindex_shape_check(value, I...) for the value array in setindex!.