In case you revisit the storage format at one point and want to give zarr a try, because you need simpler parallel read/write or object store support, we have implemented the VLenArray filter now for Julia. Here is a small MWE workflow with the latest Zarr version:
First we create some artifical ragged array data and prepare an output dataset.
using Zarr
#Generate some random data array of array data with variable length
data = [rand(Float64, rand(2:20)) for i=1:100, j=1:80];
#Create dataset on disk, pick a chunk size
s = (100,80)
cs = (30,25)
indata = zcreate(Vector{Float64},s...,
chunks=cs,
compressor=Zarr.BloscCompressor(clevel=3),
path = "myinputs.zarr"
)
#Write data to disk
indata[:,:] = data;
#Create output array, it is important here that chunk sizes match
outdata = zcreate(Float64, s..., chunks = cs, compressor=Zarr.NoCompressor(), path = "myoutputs.zarr")
then we launch a computation on multiple workers by mapping over the chunks of the data. Here we simply compute the mean of every vector in the array. No locking is necessary as long as you don’t write across chunk boundaries:
using Distributed
using DiskArrays: eachchunk
addprocs(3)
@everywhere using Zarr, Statistics
pmap(eachchunk(indata)) do cinds
zin = zopen("myinputs.zarr")
zout = zopen("myoutputs.zarr","w")
zout[cinds] = mean.(zin[cinds])
nothing
end
rmprocs(workers())
And you can access the results from Julia
zopen("myoutputs.zarr")[1:2,1:2]
2×2 Array{Float64,2}:
0.400068 0.576094
0.507728 0.436327
or from python, so you are not on a Julia-island like with JLD
using PyCall
py"""
import zarr
zin = zarr.open('myinputs.zarr')
zout = zarr.open('myoutputs.zarr')
"""
py"zin[0,0], zout[0,0]"
(PyObject array([0.18110972, 0.7273475 , 0.12056501, 0.44523791, 0.2457608 ,
0.91214274, 0.16907038, 0.83184576, 0.22446035, 0.0884847 ,
0.64998661, 0.20480489]), 0.400068030073534)