Taking the array indexing interface seriously

fabiangans · December 10, 2019, 8:25am

Just looked at your PR, this looks like a very good solution. I will try to adapt this for NetCDF and Zarr so that we get identical indexing behaviors across these packages.

fabiangans · December 10, 2019, 8:34am

I think this sounds like a reasonable idea. Your PR for HDF5 solves the problem of pure indexing into these arrays, but not the broadcast and other problems that you get when doing operations on these. Of course the next step would be to customize broadcasting on these arrays, but still a user calling stuff like sum, reduce, and basically everything you want to do with and array would face terrible slowness.

So I agree that an AbstractDiscArray would be a nice array type, where we can define relatively efficient default operations for broadcasting, reductions, and other operations.

Raf · December 10, 2019, 10:47am

Absolutely agree. We should put together a DiskArrays package at some stage. Maybe it could leverage ChunkedArrayBase.jl for reduce/broadcast efficiency.

Ratingulate · December 10, 2019, 2:00pm

I think this is something that dagger is/was meant to help with cc @shashi

musm · December 11, 2019, 5:56pm

can you explain what aspect of dagger can help here?

fabiangans · January 22, 2020, 1:48pm

For everyone who was involved in this thread, but is not following the JuliaGeo community, I want to announce that we started https://github.com/meggart/DiskArrays.jl It implements an AbstractDiskArray type which subtypes AbstractArray and replaces Base methods that would be inefficient in random access.

The plan is then to use this package as a dependency for different disk array packages like NetCDF, Zarr, ArchGDAL or HDF5 to improve the user-experience when they use these arrays and to take away maintenance burden from package authors by having the Abstract Array interface in a single place.

One question that came up is which organization this DiskArrays should belong to. We might host it under JuliaGeo, but since these data formats are not only Geo-format specific maybe JuliaArrays would be a better option. So, any feedback on the package itself and the organization choice would be great.

Mason · January 23, 2020, 1:46am

I think JuliaArrays would be more appropriate. I‘d also encourage you to make a separate announcement thread for DiskArrays showing it off to the wider Julia community.

I have a question in the meantime though: could you clarify for me the advantages / disadvantage of this relative to the mmaped arrays from the standard library?

laborg · January 23, 2020, 5:31am

As an example, I think this would allow nice access to HDF5 files that uses chunked (and compressed) data layouts. (HDF5 readmmap is only allowed for contiguous layout)

laborg · January 23, 2020, 5:43am

That sounds great and would reduce the implementation I’m doing right now.

One question: Are you planing to support a “intelligent” StepRanges? When sampling an array with a large step it’s not clear if there aren’t access patterns that would be more efficient. eg. Instead of loading all chunks from the file and exhausting the cache, single reads might be better. A simple start could be to fall back to single reads if step > chunksize.

fabiangans · January 23, 2020, 8:55am

Exactly. In adition, although the package is named “DiskArrays”, it is supposed to play nicely together with cloud-backed data (e.g. through Zarr.jl), where chunks are represented as objects in an object storage and accessing them has significant overhead through http transfer.

fabiangans · January 23, 2020, 9:06am

I have definitely thought about this and it would be nice to implement something efficient and not too complex here. On the other hand, at least NetCDF and HDF5 already provide special methods to read data that way (e.g. nc_get_vars_ family in NetCDF) where you can specify a stride along every dimension, so it would still be good to have this access pattern be customisable by the backend and provide a simple fallback like the one you described.

Feel free to start an issue in the package and suggest what exactly you had in mind.

fabiangans · February 3, 2020, 3:16pm

The package is registered now and I just announced it here: [ANN] DiskArrays.jl

Topic		Replies	Views
Implement getindex for array indexing and range indexing General Usage question	4	551	February 9, 2023
Major updates coming to HDF5.jl, request for changes before release Community	2	579	August 25, 2020
Including Named Dimensions in Base Internals & Design	2	1580	June 21, 2021
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1716	November 4, 2020
Implementing the AbstractArray interface New to Julia	4	1331	November 15, 2018

Taking the array indexing interface seriously

Related topics