Taking the array indexing interface seriously

Edit: Ok I didn’t read that part very well! Looks like mostly that is described. Although quirks like the one in my first post are not.

But it’s not clear, as @Tamas_Papp suggests, that array-like objects should even treat getindex and setindex! as a standard interface with predictable behavior.’

I’m mostly hoping that that can be the case, and in future situations like indexing in HDF5.jl don’t happen.

Generally I agree that types should not pun on functions from another context. Eg an FTP client and a drawing library should not use the same connect method.

But getindex and setindex! is somewhat of a special case since they have a nice custom syntax with []. So a lot of packages which want to use this syntax define methods without making their types <:AbstractArray, eg DataFrames is a prominent example. I think this is fine as long as the usage is well-documented, again DataFrames.jl is a good example of how to do this well.

1 Like

For sure, it has a lot of uses. But if the indexing looks identical to array indexing and returns an array - it really should stick to the array interface, even if it isn’t an Array itself. This includes HDF5Dataset and many similar Dataset objects that are loaded from disk in other packages.

https://docs.julialang.org/en/v1/manual/arrays/#Omitted-and-extra-indices-1 talks about “extra indices”

I really do suspect that HDF5 is a special case due to the peculiarities of disk-access patterns and its heritage dating from the Julia 0.0.0 days.

3 Likes

Right. HDF5.jl doesn’t support omitted or extra indices either…

Seems like a real <: AbstractArray object would be needed to do these things properly - maybe that’s the answer to this. Just make a wrapper that is actually <: AbstractArray whenever possible.

@fabiangans thinking about what you said a separate interface for disk-based arrays is really something that is missing here.

An AbstractDiskArray that follows AbstractArray indexing correctly, but defines more sensible baviors for disk-backed array-like data - like avoiding the issues with show or broadcast. Traits could be used for chunking? There are quite a few formats that could benefit from this. It would definitely make writing GeoData.jl a lot easier.

BTW I think some of the issues mentioned for HDF5.jl may also apply to Zarr.jl - the indexing doesn’t exactly match AbstractArray

AbstractArrays have a handful of special behaviors, but it really just boils down to these four things:

  • Extensible support for converting the passed indices to a single location or array of locations with to_indices/to_index for each dimension. This is what handles Integer conversions, CartesianIndex, logical indexing (AbstractArray{Bool}s), and indexing by :. Calling to_indices doesn’t actually require ::AbstractArray arguments, but it only gets called by default for AbstractArrays, meaning you can easily opt-in to these behaviors.
  • AbstractArrays of indices get expanded out into the Cartesian product of single-location indices, taking into account the dimensionality of each index (we colloquially call this APL or dimsum indexing).
  • Support for linear indexing.
  • Support for omitted and trailing indices.

All should be documented in the arrays chapter. The AbstractArray interface doesn’t explicitly cover all of these simply because if you <: AbstractArray you don’t need to worry about them — you get them all for free!

There’s a tension for how array-like non-AbstractArrays should be.

9 Likes

Ok that is a great summary. Maybe I’ll integrate that in the introduction paragraph in the array docs.

Using to_indices/to_index is something that isn’t happening in array-like data packages. HDF5.jl, Zarr.jl, my pull request to ArchGDAL.jl (lol), probably a few others.

Some other disk based array packages are slightly altering the indexing syntax although they are <: AbstractArray- ASDF, NetCDF, GeoArrays.

1 Like

Thanks everyone, this should lead to proper indexing for at least HDF5 https://github.com/JuliaIO/HDF5.jl/pull/603 and ArchGDAL.

The key base methods seem to be calling to_index(A, I) which then requires the method eachindex(::LinearIndices, A::MyArrayLike) and setindex_shape_check(value, I...) for the value array in setindex!.

Just looked at your PR, this looks like a very good solution. I will try to adapt this for NetCDF and Zarr so that we get identical indexing behaviors across these packages.

I think this sounds like a reasonable idea. Your PR for HDF5 solves the problem of pure indexing into these arrays, but not the broadcast and other problems that you get when doing operations on these. Of course the next step would be to customize broadcasting on these arrays, but still a user calling stuff like sum, reduce, and basically everything you want to do with and array would face terrible slowness.

So I agree that an AbstractDiscArray would be a nice array type, where we can define relatively efficient default operations for broadcasting, reductions, and other operations.

Absolutely agree. We should put together a DiskArrays package at some stage. Maybe it could leverage ChunkedArrayBase.jl for reduce/broadcast efficiency.

1 Like

I think this is something that dagger is/was meant to help with cc @shashi

can you explain what aspect of dagger can help here?

For everyone who was involved in this thread, but is not following the JuliaGeo community, I want to announce that we started https://github.com/meggart/DiskArrays.jl It implements an AbstractDiskArray type which subtypes AbstractArray and replaces Base methods that would be inefficient in random access.

The plan is then to use this package as a dependency for different disk array packages like NetCDF, Zarr, ArchGDAL or HDF5 to improve the user-experience when they use these arrays and to take away maintenance burden from package authors by having the Abstract Array interface in a single place.

One question that came up is which organization this DiskArrays should belong to. We might host it under JuliaGeo, but since these data formats are not only Geo-format specific maybe JuliaArrays would be a better option. So, any feedback on the package itself and the organization choice would be great.

9 Likes

I think JuliaArrays would be more appropriate. I‘d also encourage you to make a separate announcement thread for DiskArrays showing it off to the wider Julia community.

I have a question in the meantime though: could you clarify for me the advantages / disadvantage of this relative to the mmaped arrays from the standard library?

1 Like

As an example, I think this would allow nice access to HDF5 files that uses chunked (and compressed) data layouts. (HDF5 readmmap is only allowed for contiguous layout)

2 Likes

That sounds great and would reduce the implementation I’m doing right now.

One question: Are you planing to support a “intelligent” StepRanges? When sampling an array with a large step it’s not clear if there aren’t access patterns that would be more efficient. eg. Instead of loading all chunks from the file and exhausting the cache, single reads might be better. A simple start could be to fall back to single reads if step > chunksize.

1 Like

Exactly. In adition, although the package is named “DiskArrays”, it is supposed to play nicely together with cloud-backed data (e.g. through Zarr.jl), where chunks are represented as objects in an object storage and accessing them has significant overhead through http transfer.

3 Likes

I have definitely thought about this and it would be nice to implement something efficient and not too complex here. On the other hand, at least NetCDF and HDF5 already provide special methods to read data that way (e.g. nc_get_vars_ family in NetCDF) where you can specify a stride along every dimension, so it would still be good to have this access pattern be customisable by the backend and provide a simple fallback like the one you described.

Feel free to start an issue in the package and suggest what exactly you had in mind.

2 Likes