A Julia-compatible alternative to zarr

fabiangans · June 21, 2018, 7:59am

Dear Julia community,

we are looking for a object-storage compatible file format for quite large dense N-dimensional arrays. Our project partners suggested this Zarr — zarr 2.13.6 documentation format, but there seems to be only a python implementation. I am a bit afraid of the overhead that PyCall would add for every read/write so my question is if you know of alternative formats that already have a Julia wrapper or at least a C API that could be wrapped.

Thanks already for your suggestions

Tamas_Papp · June 21, 2018, 8:08am

I would just use a large file and mmap a Vector{T}, which I would then reshape to an n-dimensional Array. This is fast and efficient, and is rather seamless in Julia (you can store metadata, eg the actual dimensions, in another file). I assume that Python has similar capabilities so you can interoperate.

Also, if T is a widely used conventional bits type, eg Int32 or Float64, you should be able to use the same file with other languages.

fabiangans · June 21, 2018, 8:30am

Thanks,

yes, T will be a conventional bits type. I forgot to mention that we would like to store the data in compressed chunks, so using the Mmap solution you suggested would be a bit more complicated? I know I could use

and then only take care that the chunks are collected into an AbstractArray, probably using

However, I thought before inventing a custom file/metadata format one might be able to reuse something that already exists.

Tamas_Papp · June 21, 2018, 9:01am

Compression is a huge hassle and will slow down access by orders of magnitude. If the data compresses so well that it is seemingly worthwhile, I would consider some recoding/bit packing, basically anything to make it more compact.

Just to give an example, frequently I find that I can easily repack dates into 16 bits using a custom epoch, so I wrote a small library

https://github.com/tpapp/FlexDates.jl

I find mmap so convenient and fast that it is worth the extra steps.

fabiangans · June 21, 2018, 2:52pm

The reason compression works so well is that we have gridded geodata where most of the datasets are land-only and have some gaps, so there are approx 70-90% missing values. So, the missing ratio is in the range where it does not yet make sense to switch to a sparse data structure, but compression really reduces used disk space even for the lowest compression level.

visr · June 21, 2018, 4:19pm

Would be good to test if it is indeed much slower over PyCall for your typical usage.
I don’t know of an existing alternative. But components seem to be there. Zarr uses blosc for compression. And we have libraries to deal with chunking as well, such as DistributedArrays and Blocks.

laborg · June 21, 2018, 4:47pm

Would HDF5 suit your needs (it supports compression and chunking)?

colintbowers · June 22, 2018, 5:54am

Seconded. I can’t see any reason in this thread why you couldn’t use HDF5, and it has the bonus that it is widely used - most big languages have a wrapper package on the C HDF5 library.

One thing to beware of though is that last time I played with HDF5, if you have a huge number of reads and writes in a single session ie in the hundreds of thousands, then the quit() command when you finish your Julia session takes quite a long time to run. I never did manage to work out why.

fabiangans · June 22, 2018, 6:36am

The point is we want to store the data in the cloud as object storage like on S3 and do parallel reads and writes to single files. There is this solution https://www.hdfgroup.org/solutions/hdf-cloud/ but it does not seem to be open and would require us to run an extra service. So we are looking for something that would work out of the box there. I have to admit that I am not the expert on these storage technologies, but our project partners said HDF5 would not be an option.

I think I will try PyCall with zarr for our use cases first and see how it performs.

colintbowers · June 22, 2018, 6:47am

Ah. I didn’t realise you were after a cloud solution. Given that the object is stored in the cloud, my guess is the bottleneck will be upload/download speed, rather than PyCall, so hopefully that should work out fine. But, as you say, best to test it out first.

Cheers,

Colin

Tamas_Papp · June 22, 2018, 6:58am

Just being curious, how large is the data? Eg compressed/uncompressed in GB, total and for the typical array.

johnh · June 22, 2018, 7:19am

@fabiangans Why the insistence on parallel I/O Do you really need that? At Tamas Papp says, please give some ideas of your data sizes.
I was going to start wittering about CEPH storage here, as the block diagram for HDF5-Cloud looks like it should work with CEPHH also. But I dont have anything meaningful to say.
HDF5-Cloud does look very interesting though!

However have you looked at Adios? I came across it recently in connection with parallel IO from the OpenFOAM CDF code.
https://www.olcf.ornl.gov/center-projects/adios/
https://wiki.scinet.utoronto.ca/wiki/images/8/8c/Adios-techtalk-may2012.pdf

fabiangans · June 22, 2018, 7:47am

The whole uncompressed dataset is about 1TB at the moment, it may grow by a factor of 2-3, but not by orders of magnitude. The compressed size is about 100GB.

So far the dataset is stored in a set of NetCDF files (which are internally HDF5 files), where a single file is 1.6GB uncompressed and about 160MB compressed.

One reason we want to go for object storage is that it is simply cheaper at most cloud providers.

johnh · June 22, 2018, 12:53pm

I cannot add anything useful here, other than saying that you should cost in moving the data into or out of your chosen cloud platform.
Some Googling leads to this interesting blog post:
http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

I have no idea what Cloud Optimized Geo tiff is but it might interest you.

That is a damn interesting blog post BTW, there are two similar standards zarr and N5 (Not HDF5!)
Should we be looking at a native Julia Z5 https://github.com/constantinpape/z5

fabiangans · June 22, 2018, 1:22pm

Thanks for sharing this blog post and the link to Z5, I agree it is great. Cloud Optimized GeoTIFF was mentioned in our discussions and I will check if it is flexible enough. I think a native Julia Z5 would be great to see in the future, but I won’t have the time to seriously start such an initiative.

jakirkham · August 1, 2018, 4:01am

If any of you have thoughts on how we can help you get a Julia implementation of Zarr going, we’d be interested in hearing them. Have opened issue ( Connecting with the Julia Community · Issue #17 · zarr-developers/community · GitHub ) for this discussion.

rabernat · August 9, 2018, 3:22pm

I just want to reiterate the hope that zarr could become a multi-language array storage format. The zarr specification is clearly written and is in no-way python-specific. The basic principles of compressed chunked array storage are so universal that two independent implementations (zarr and N5) are basically interoperable already (see Z5).

It would be fantastic to see a Julia implementation of the Zarr spec. The Zarr developers and the Pangeo project would be really happy to find some Julia collaborators interested in working on this.

jwu · April 26, 2019, 9:24pm

I have a package doing similar things:
https://github.com/seung-lab/BigArrays.jl

modifying my package to fit zarr spec should be easy.

fabiangans · April 26, 2019, 9:30pm

This looks great.

In the meantime we started this: https://github.com/meggart/ZarrNative.jl which is still in development but working quite well. However, I would be very happy to merge efforts in case you decide to fit the zarr specs.

rabernat · December 18, 2019, 8:17pm

FYI, for anyone who wants to try out Zarr.jl with data stored in Google Cloud, here is a repo / binder with some examples:

https://github.com/pangeo-data/pangeo-julia-examples/

Topic		Replies	Views
Compressed array-on-disk Data data-compression	9	747	October 11, 2021
(De-)Serialize N-dimensional arrays in julia New to Julia question , package , serialization	28	2186	June 4, 2021
File Format for Large Two-Dimensional Dataset Data	19	2611	July 31, 2018
How to save an array to disk in compressed form? General Usage question , data-compression	9	2850	January 24, 2023
Data Import Types and Compression New to Julia question , data-compression	1	1165	December 22, 2017

A Julia-compatible alternative to zarr

Related topics