Taking the array indexing interface seriously

Raf · December 9, 2019, 4:42am

I’m doing some work standardizing array interfaces for the spatial data ecosystem in
HDF5, ArchGDAL and in GeoData.

It’s not uncommon for widely use packages to have non-standard indexing inputs and outputs. HDF5.jl currently doesn’t drop dims indexed with Int, so it returns unexpected array dimensions and forces you to write a bunch of special casing code. But the required indexing behavior isn’t so easy to track down.

I keep finding weird inputs that are actually allowed in Base, like:

julia> [1, 2][2, 1, 1, 1]
2

It would be great to have a more comprehensive list of exactly what array dimensions should be returned given all allowed inputs, so that packages providing arrays or array-like data can be more easily standardized. Maybe as part of the interfaces manual?

Tamas_Papp · December 9, 2019, 7:51am

You are running into getindex quietly ignoring an arbitrary number of excess 1s.

It is a wart, but it is somewhat deeply tied into a lot of things, and unlikely to go away until 2.0.

Raf · December 9, 2019, 8:02am

Yeah. But its not clear what indexing behaviours should be implemented and which are artefacts that dont really matter. Thats what needs to be clarified.

tim.holy · December 9, 2019, 9:58am

You may know this already, but anything written to the Interfaces · The Julia Language interface gets all the indexing behavior for free—you don’t have to implement complex behaviors.

HDF5 was started long before the array interface was standardized, so the package may just need some updating. EDIT: I should acknowledge that efficient disk access patterns are different from efficient RAM access patterns, so indeed HDF5 may need some special “chunking” methods. However, that should not apply to the majority of AbstractArray types.

Tamas_Papp · December 9, 2019, 10:10am

Indeed: one also gets the getindex tail behavior for free:

julia> struct MyArray{N} <: AbstractArray{Int,N}
       _size::NTuple{N,Int}
       end

julia> Base.size(m::MyArray) = m._size

julia> Base.getindex(m::MyArray{N}, I::Vararg{Int,N}) where {N} = sum(I)

julia> A = MyArray((2, 3, 4))
2×3×4 MyArray{3}:
[:, :, 1] =
A[1,2,3]
 3  4  5
 4  5  6

[:, :, 2] =
 4  5  6
 5  6  7

[:, :, 3] =
 5  6  7
 6  7  8

[:, :, 4] =
 6  7  8
 7  8  9

julia> A[1,2,3]
6

julia> A[1,2,3,1]
6

Raf · December 9, 2019, 11:07am

Thanks. I guess I’m mostly discussing the case where the object is not <: AbstractArray{T,N}, but implements rudimentary array-like indexing, as in HDF5Dataset. This can be hard to avoid when loading from disk or some C interface, as T and N aren’t necessarily known or fixed when the indexed object is constructed.

The answer might be that these objects need a conversion step to an object that is <: AbstractArray.

Tamas_Papp · December 9, 2019, 11:42am

If the object is not an <:AbstractArray, then you are not bound by the rules of the latter — the implementation can do arbitrary things. Moreover, a lot of methods that require an <: AbtractArray will not accept these objects (as arrays).

I am not familiar with the internals of the above packages, but either a conversion wrapper or possibly direct support for the array interface could be beneficial.

fabiangans · December 9, 2019, 11:55am

I am not familiar with the internals of the above packages, but either a conversion wrapper or possibly direct support for the array interface could be beneficial.

As Tim Holy mentioned, the AbstractArray interface is optimized for RAM access and not for disk access. If you tried, for example, to make a HDF5 variable a subtype of AbstractArray, this would give you a lot of behavior for free, but it would be very inefficient. A simple call to show in the REPL can take ages, because the AbstractArray interface reads every shown value through a separate call to the C interface, while the values might be compressed in chunks of a some GB size. The same is true for Colon-indexing etc.

I once suggested this GitHub - meggart/ChunkedArrayBase.jl but it did not get a big amount of traction in other packages so far. Common interface for chunked array stores? · Issue #557 · JuliaIO/HDF5.jl · GitHub

Raf · December 9, 2019, 12:10pm

I know that theoretically they can return anything from getindex, but I’m saying that in practice they probably shouldn’t.

Returning ie. a non-standard dimension array when using familiar indexing syntax is a horror show for the ecosystem. A package shouldn’t use getindex if you are going to do this, a different method should be considered, that doesn’t suggest a common set of behaviors when passed Int/AbstractArray args.

@fabiangans For sure, not being in <: AbstractArray is often necessary. But if so we should still treat getindex as a standard interface. Most properties of the returned array are orthogonal to the performance concerns of actually loading it - reshape doesn’t touch the underlying data.

There are issues like this for HDF5, and the fix isn’t controversial.
https://github.com/JuliaIO/HDF5.jl/issues/551

But it’s not really clear that these changes should be the accepted standard for getindex/setindex! of array like objects outside of AbstractArray.

Tamas_Papp · December 9, 2019, 12:11pm

I think the best way to deal with these issues is to implement these methods in a more efficient way. There is nothing that precludes this in the <: AbstractArray interface: the list of methods there are but a lower bound. Several types in Base specialize methods for efficiency (eg searchsortedfirst for ranges).

In particular, Base.show should never print a GB of data, the bottleneck for that will probably be IO with the IDE. It should respect print limits of the IOContext.

tim.holy · December 9, 2019, 12:41pm

Looking at this again, there really is quite a lot of documentation about array behavior at Multi-dimensional Arrays · The Julia Language. When that isn’t enough, stepping into a getindex call with the debugger can help reveal the internal mechanisms.

Raf · December 9, 2019, 1:05pm

Edit: Ok I didn’t read that part very well! Looks like mostly that is described. Although quirks like the one in my first post are not.

But it’s not clear, as @Tamas_Papp suggests, that array-like objects should even treat getindex and setindex! as a standard interface with predictable behavior.’

I’m mostly hoping that that can be the case, and in future situations like indexing in HDF5.jl don’t happen.

Tamas_Papp · December 9, 2019, 1:25pm

Generally I agree that types should not pun on functions from another context. Eg an FTP client and a drawing library should not use the same connect method.

But getindex and setindex! is somewhat of a special case since they have a nice custom syntax with []. So a lot of packages which want to use this syntax define methods without making their types <:AbstractArray, eg DataFrames is a prominent example. I think this is fine as long as the usage is well-documented, again DataFrames.jl is a good example of how to do this well.

Raf · December 9, 2019, 1:28pm

For sure, it has a lot of uses. But if the indexing looks identical to array indexing and returns an array - it really should stick to the array interface, even if it isn’t an Array itself. This includes HDF5Dataset and many similar Dataset objects that are loaded from disk in other packages.

tim.holy · December 9, 2019, 1:29pm

Single- and multi-dimensional Arrays · The Julia Language talks about “extra indices”

I really do suspect that HDF5 is a special case due to the peculiarities of disk-access patterns and its heritage dating from the Julia 0.0.0 days.

Raf · December 9, 2019, 1:47pm

Right. HDF5.jl doesn’t support omitted or extra indices either…

Seems like a real <: AbstractArray object would be needed to do these things properly - maybe that’s the answer to this. Just make a wrapper that is actually <: AbstractArray whenever possible.

Raf · December 9, 2019, 2:02pm

@fabiangans thinking about what you said a separate interface for disk-based arrays is really something that is missing here.

An AbstractDiskArray that follows AbstractArray indexing correctly, but defines more sensible baviors for disk-backed array-like data - like avoiding the issues with show or broadcast. Traits could be used for chunking? There are quite a few formats that could benefit from this. It would definitely make writing GeoData.jl a lot easier.

BTW I think some of the issues mentioned for HDF5.jl may also apply to Zarr.jl - the indexing doesn’t exactly match AbstractArray

mbauman · December 9, 2019, 5:16pm

AbstractArrays have a handful of special behaviors, but it really just boils down to these four things:

Extensible support for converting the passed indices to a single location or array of locations with to_indices/to_index for each dimension. This is what handles Integer conversions, CartesianIndex, logical indexing (AbstractArray{Bool}s), and indexing by :. Calling to_indices doesn’t actually require ::AbstractArray arguments, but it only gets called by default for AbstractArrays, meaning you can easily opt-in to these behaviors.
AbstractArrays of indices get expanded out into the Cartesian product of single-location indices, taking into account the dimensionality of each index (we colloquially call this APL or dimsum indexing).
Support for linear indexing.
Support for omitted and trailing indices.

All should be documented in the arrays chapter. The AbstractArray interface doesn’t explicitly cover all of these simply because if you <: AbstractArray you don’t need to worry about them — you get them all for free!

There’s a tension for how array-like non-AbstractArrays should be.

Raf · December 9, 2019, 10:37pm

Ok that is a great summary. Maybe I’ll integrate that in the introduction paragraph in the array docs.

Using to_indices/to_index is something that isn’t happening in array-like data packages. HDF5.jl, Zarr.jl, my pull request to ArchGDAL.jl (lol), probably a few others.

Some other disk based array packages are slightly altering the indexing syntax although they are <: AbstractArray- ASDF, NetCDF, GeoArrays.

Raf · December 10, 2019, 1:38am

Thanks everyone, this should lead to proper indexing for at least HDF5 https://github.com/JuliaIO/HDF5.jl/pull/603 and ArchGDAL.

The key base methods seem to be calling to_index(A, I) which then requires the method eachindex(::LinearIndices, A::MyArrayLike) and setindex_shape_check(value, I...) for the value array in setindex!.

Topic		Replies	Views
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1730	November 4, 2020
Is OffsetArrays.jl a poison pill? Internals & Design	63	5104	December 5, 2022
Does `AbstractArray` itself need to support indexing beyond 1-based? Internals & Design indexing	15	1492	July 5, 2022
Interpolate over large HDF5 arrays General Usage hdf5 , interpolations	13	215	March 23, 2025
`eachindex(x)` replacement for `1:length(x)-1` and similar General Usage	27	2897	May 24, 2022

Taking the array indexing interface seriously

Related topics