Just looked at your PR, this looks like a very good solution. I will try to adapt this for NetCDF and Zarr so that we get identical indexing behaviors across these packages.
I think this sounds like a reasonable idea. Your PR for HDF5 solves the problem of pure indexing into these arrays, but not the broadcast and other problems that you get when doing operations on these. Of course the next step would be to customize broadcasting on these arrays, but still a user calling stuff like sum, reduce, and basically everything you want to do with and array would face terrible slowness.
So I agree that an AbstractDiscArray would be a nice array type, where we can define relatively efficient default operations for broadcasting, reductions, and other operations.
Absolutely agree. We should put together a DiskArrays package at some stage. Maybe it could leverage ChunkedArrayBase.jl for reduce/broadcast efficiency.
can you explain what aspect of dagger can help here?
For everyone who was involved in this thread, but is not following the JuliaGeo community, I want to announce that we started https://github.com/meggart/DiskArrays.jl It implements an AbstractDiskArray
type which subtypes AbstractArray
and replaces Base methods that would be inefficient in random access.
The plan is then to use this package as a dependency for different disk array packages like NetCDF, Zarr, ArchGDAL or HDF5 to improve the user-experience when they use these arrays and to take away maintenance burden from package authors by having the Abstract Array interface in a single place.
One question that came up is which organization this DiskArrays should belong to. We might host it under JuliaGeo, but since these data formats are not only Geo-format specific maybe JuliaArrays would be a better option. So, any feedback on the package itself and the organization choice would be great.
I think JuliaArrays would be more appropriate. I‘d also encourage you to make a separate announcement thread for DiskArrays showing it off to the wider Julia community.
I have a question in the meantime though: could you clarify for me the advantages / disadvantage of this relative to the mmaped arrays from the standard library?
As an example, I think this would allow nice access to HDF5 files that uses chunked (and compressed) data layouts. (HDF5 readmmap
is only allowed for contiguous layout)
That sounds great and would reduce the implementation I’m doing right now.
One question: Are you planing to support a “intelligent” StepRange
s? When sampling an array with a large step it’s not clear if there aren’t access patterns that would be more efficient. eg. Instead of loading all chunks from the file and exhausting the cache, single reads might be better. A simple start could be to fall back to single reads if step
> chunksize
.
Exactly. In adition, although the package is named “DiskArrays”, it is supposed to play nicely together with cloud-backed data (e.g. through Zarr.jl), where chunks are represented as objects in an object storage and accessing them has significant overhead through http transfer.
I have definitely thought about this and it would be nice to implement something efficient and not too complex here. On the other hand, at least NetCDF and HDF5 already provide special methods to read data that way (e.g. nc_get_vars_
family in NetCDF) where you can specify a stride along every dimension, so it would still be good to have this access pattern be customisable by the backend and provide a simple fallback like the one you described.
Feel free to start an issue in the package and suggest what exactly you had in mind.
The package is registered now and I just announced it here: [ANN] DiskArrays.jl