[ANN] DiskArrays.jl

fabiangans · February 3, 2020, 3:14pm

As already discussed here I worked on a package that would make it easier to use AbstractArrays that map to chunked data on disk.

The package provides an AbstractDiskArray which you can subtype to get efficient indexing for common access patterns, efficient reductions and lazy broadcasting for free. Currently this is only implemented for Zarr.jl, but PRs for NetCDF.jl and ArchGDAL.jl are pending and other are planned.

Example

We open a Zarr array on GCS. Note that accessing a single element from the array takes about 18s on my machine (limited by internet connection), because a whole chunk of data has to be transferred and decompressed to access the single value. So we open a reference to the Array and use normal boradcasting syntax to convert to Celsius and multiply with latitudinal weights:

using Zarr, Statistics
g = zopen("gs://cmip6/CMIP/NCAR/CESM2/historical/r9i1p1f1/Amon/tas/gn/", consolidated=true)

latweights = reshape(cosd.(g["lat"])[:],1,192,1)
t_celsius = g["tas"].-273.15
t_w = t_celsius .* latweights

Disk Array with size 288 x 192 x 1980

Nothing happens yet and the subsequent broadcast operations will be fused. Then we compute the mean over the spatial dimensions:

mean(t_w, dims = (1,2))./mean(latweights)

which will iterate over each chunk of the array at a time, making sure that every chunk is accessed and transferred only once. So, memory usage is in the order of the size of a single chunk and the whole computation takes about a minute, which equals the time to read the whole array.

Implementation

I tried to re-use as much of Julia Base functionality as possible to implement the custom reductions and broadcasting. So basically every function that falls back to mapreduce or mapfoldl (which includes, maximum, minimum, any, all, sum, …) will have an efficient access pattern.

For broadcasting I just wrap the broadcasted expression into a new BroadcastDiskArray type which again subtypes AbstractDiskArray to benefit from the changed getindex and mapreduce behavior.

I have to admit it was a tough experience to hook into the mapreduce and broadcast code, but in the end I was surprised to get the behavior described above with only ~140 LOC.

johnh · February 3, 2020, 6:22pm

Sounds very interesting!
Perhaps (as usual) a lunatic comment from me. Tape is of course still heavily used for long term storage.
I doubt anyone actually reads directly from tapes any more - the data will be staged to disk.
However I note that LTFS exists

Would it be lunatic to think that Julia could interact with LTFS?

Yifan_Liu · October 31, 2020, 12:15am

Do we have a disk data frame library?

fabiangans · November 2, 2020, 10:41am

I don’t know, and I am not sure we need one. I think the Tables.jl interface is carefully designed with out-of-core Tables in mind, and many different data formats supporting table-like data already implement this interface. Where exactly do you see the gap?

Topic		Replies	Views
DiskArrays get array of values given a vector of Indices New to Julia question	6	380	April 27, 2022
Taking the array indexing interface seriously Internals & Design question	31	2840	February 3, 2020
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1730	November 4, 2020
Compressed array-on-disk Data data-compression	9	751	October 11, 2021
A Julia-compatible alternative to zarr Data data-compression	19	4452	December 18, 2019

[ANN] DiskArrays.jl

Example

Implementation

Related topics