As already discussed here I worked on a package that would make it easier to use AbstractArray
s that map to chunked data on disk.
The package provides an AbstractDiskArray
which you can subtype to get efficient indexing for common access patterns, efficient reductions and lazy broadcasting for free. Currently this is only implemented for Zarr.jl, but PRs for NetCDF.jl and ArchGDAL.jl are pending and other are planned.
Example
We open a Zarr array on GCS. Note that accessing a single element from the array takes about 18s on my machine (limited by internet connection), because a whole chunk of data has to be transferred and decompressed to access the single value. So we open a reference to the Array and use normal boradcasting syntax to convert to Celsius and multiply with latitudinal weights:
using Zarr, Statistics
g = zopen("gs://cmip6/CMIP/NCAR/CESM2/historical/r9i1p1f1/Amon/tas/gn/", consolidated=true)
latweights = reshape(cosd.(g["lat"])[:],1,192,1)
t_celsius = g["tas"].-273.15
t_w = t_celsius .* latweights
Disk Array with size 288 x 192 x 1980
Nothing happens yet and the subsequent broadcast operations will be fused. Then we compute the mean over the spatial dimensions:
mean(t_w, dims = (1,2))./mean(latweights)
which will iterate over each chunk of the array at a time, making sure that every chunk is accessed and transferred only once. So, memory usage is in the order of the size of a single chunk and the whole computation takes about a minute, which equals the time to read the whole array.
Implementation
I tried to re-use as much of Julia Base functionality as possible to implement the custom reductions and broadcasting. So basically every function that falls back to mapreduce
or mapfoldl
(which includes, maximum
, minimum
, any
, all
, sum
, …) will have an efficient access pattern.
For broadcasting I just wrap the broadcasted expression into a new BroadcastDiskArray
type which again subtypes AbstractDiskArray
to benefit from the changed getindex and mapreduce behavior.
I have to admit it was a tough experience to hook into the mapreduce and broadcast code, but in the end I was surprised to get the behavior described above with only ~140 LOC.