A Julia-compatible alternative to zarr


#1

Dear Julia community,

we are looking for a object-storage compatible file format for quite large dense N-dimensional arrays. Our project partners suggested this http://zarr.readthedocs.io/en/stable/ format, but there seems to be only a python implementation. I am a bit afraid of the overhead that PyCall would add for every read/write so my question is if you know of alternative formats that already have a Julia wrapper or at least a C API that could be wrapped.

Thanks already for your suggestions


#2

I would just use a large file and mmap a Vector{T}, which I would then reshape to an n-dimensional Array. This is fast and efficient, and is rather seamless in Julia (you can store metadata, eg the actual dimensions, in another file). I assume that Python has similar capabilities so you can interoperate.

Also, if T is a widely used conventional bits type, eg Int32 or Float64, you should be able to use the same file with other languages.


#3

Thanks,

yes, T will be a conventional bits type. I forgot to mention that we would like to store the data in compressed chunks, so using the Mmap solution you suggested would be a bit more complicated? I know I could use

and then only take care that the chunks are collected into an AbstractArray, probably using

However, I thought before inventing a custom file/metadata format one might be able to reuse something that already exists.


#4

Compression is a huge hassle and will slow down access by orders of magnitude. If the data compresses so well that it is seemingly worthwhile, I would consider some recoding/bit packing, basically anything to make it more compact.

Just to give an example, frequently I find that I can easily repack dates into 16 bits using a custom epoch, so I wrote a small library

I find mmap so convenient and fast that it is worth the extra steps.


#5

The reason compression works so well is that we have gridded geodata where most of the datasets are land-only and have some gaps, so there are approx 70-90% missing values. So, the missing ratio is in the range where it does not yet make sense to switch to a sparse data structure, but compression really reduces used disk space even for the lowest compression level.


#6

Would be good to test if it is indeed much slower over PyCall for your typical usage.
I don’t know of an existing alternative. But components seem to be there. Zarr uses blosc for compression. And we have libraries to deal with chunking as well, such as DistributedArrays and Blocks.


#7

Would HDF5 suit your needs (it supports compression and chunking)?


#8

Seconded. I can’t see any reason in this thread why you couldn’t use HDF5, and it has the bonus that it is widely used - most big languages have a wrapper package on the C HDF5 library.

One thing to beware of though is that last time I played with HDF5, if you have a huge number of reads and writes in a single session ie in the hundreds of thousands, then the quit() command when you finish your Julia session takes quite a long time to run. I never did manage to work out why.


#9

The point is we want to store the data in the cloud as object storage like on S3 and do parallel reads and writes to single files. There is this solution https://www.hdfgroup.org/solutions/hdf-cloud/ but it does not seem to be open and would require us to run an extra service. So we are looking for something that would work out of the box there. I have to admit that I am not the expert on these storage technologies, but our project partners said HDF5 would not be an option.

I think I will try PyCall with zarr for our use cases first and see how it performs.


#10

Ah. I didn’t realise you were after a cloud solution. Given that the object is stored in the cloud, my guess is the bottleneck will be upload/download speed, rather than PyCall, so hopefully that should work out fine. But, as you say, best to test it out first.

Cheers,

Colin


#11

Just being curious, how large is the data? Eg compressed/uncompressed in GB, total and for the typical array.


#12

@fabiangans Why the insistence on parallel I/O Do you really need that? At Tamas Papp says, please give some ideas of your data sizes.
I was going to start wittering about CEPH storage here, as the block diagram for HDF5-Cloud looks like it should work with CEPHH also. But I dont have anything meaningful to say.
HDF5-Cloud does look very interesting though!

However have you looked at Adios? I came across it recently in connection with parallel IO from the OpenFOAM CDF code.



#13

The whole uncompressed dataset is about 1TB at the moment, it may grow by a factor of 2-3, but not by orders of magnitude. The compressed size is about 100GB.

So far the dataset is stored in a set of NetCDF files (which are internally HDF5 files), where a single file is 1.6GB uncompressed and about 160MB compressed.

One reason we want to go for object storage is that it is simply cheaper at most cloud providers.


#14

I cannot add anything useful here, other than saying that you should cost in moving the data into or out of your chosen cloud platform.
Some Googling leads to this interesting blog post:
http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

I have no idea what Cloud Optimized Geo tiff is but it might interest you.

That is a damn interesting blog post BTW, there are two similar standards zarr and N5 (Not HDF5!)
Should we be looking at a native Julia Z5 https://github.com/constantinpape/z5


#15

Thanks for sharing this blog post and the link to Z5, I agree it is great. Cloud Optimized GeoTIFF was mentioned in our discussions and I will check if it is flexible enough. I think a native Julia Z5 would be great to see in the future, but I won’t have the time to seriously start such an initiative.