Does anyone know if there is a Kerchunk-like package for Julia or if there is any work towards such a package? Kerchunk for Julia would be immensely useful.
I don’t think there is one yet. From his experience with kerchunk
and fsspec
@lsterzinger may have something to say about how difficult it would be to implement. Apart from that it could start from a wrapping the appropriate c libraries, no? (We already have one part which is Zarr.jl)
In terms of creating kerchunk metadata files, it depends on whether Julia implementations of HDF/NetCDF/GRIB/others allow for scanning and retrieving byte ranges, compression info, etc from remote files. With Kerchunk this is done via fsspec which provides a unified filesystem interface to access files stored across a wide range of storage media. Then it’s just a matter of building that info into the Kerchunk/RefernceFileSystem specification outlined here References specification — kerchunk documentation
The difficult part would be porting the fsspec reference filesystem, to which the proper metadata generated via the above is passed. This is what maps the filesystem paths of zarr chunks referenced in kerchunk metadata files into byte ranges in the binary data file, which is what actually allows for zarr-like access of the data. The documentation for this (and a link to source) is linked below.
I wonder if @meggart, author of many relevant packages, has any thoughts on this. My guess is that he has.
Never heard about this but since HDF/NetCDF/GRIB/
were mentioned I think the best place to put the pressure is the GDAL list (I mean gdal-dev@lists.osgeo.org). Te rest (access from Julia) will come as a bonus.
Author of kerchunk here, just became aware of this thread. Will read and comment soon. I do not know how to Julia at all.
OK, so let me make a couple of comments.
- the references file spec, currently in JSON, is language-independent. Tools for creating them are written in python in the kerchunk repo, but there is no need for these once the references file is made. (certain specialised codecs excepted)
- zarr works in julia
- the reference filesystem (or its key-value interface) is really simple and anyone can implement this
- the hardest part is figuring out how the reference filesystem can call the other storage backends like local, S3 and GCS. This is what fsspec does for python, and I have no idea how these kind of things happen in julia land.
Yes, referenceFS could be built into GDAL, which already has an internal concept of virtual file systems. However, a main aim of the kerchunk idea was not to need heavyweight monolithic libraries like that, but rather to offer asynchronous and parallel reading of the sort used by dask. Note that there is nothing geo or even array specific about kerchunk. You could use this on many different dataset layouts.
So it seems like what we need is a way to accessing kerchunked files, is this what you meant @alex-s-gardner? In that case what we need is fsspec and not kerchunk itself if I follow Lucas and Martin correctly.
@aramirezreyes we need the references file spec, the ability to build virtual datasets from multiple files, and have those virtual datasets play nicely with Zarr.
I have to admit that I really have not looked into kerchunk a lot yet, but the idea is really interesting.
I agree that it would be nice to have an fsspec-like package in Julia. I think a first step towards fsspec functionality has already been done with FilePathsBase which is an abstraction layer already implemented by both file systems and S3. However, what is still missing would be an interface function for reading byte ranges from these, but this should be possible to add if someone is interested.
An fsspec-like package would also be a big help for the development of Zarr.jl itself, because it would make the definition of quite a few backends obsolete since the provision of a filesystem-like interface to different storage backends would be done by that package then.
Then I think it would be quite simple to start from a python-generated kerchunk metadata file file and import it into a small additional Zarr.jl backend to expose the data as chunks of a single Zarr array. A much larger steps would be building these metadata files by scanning/retrieving byte ranges, since as mentioned above we would need some functionality built into upstream libraries which would be a larger task.
I hope it is OK to add here. I am amazed at the usefulness of data management/retrieval pakages like Kerchunk and Zarr.
I have worked in scientific and engineering computing support for many years. In the early days to get some data you would literally have to ask for a dataset to be sent on round tapes (*)
Later on we had larger capacity disks - but engineers use ‘meaningful directory names’ resulting in inflexible naming schemes for directories which in effect are metadata by the back door. I always said this was an awful way to arrange data storage/access.
Now we see a new generation which expects to be able to download a myriad of datasets : climate, satellite, medical refrernec datasets… AND understand the format.
I hope this leads to some truly great progress in science and commerce. Less time wrangling with actually getting your hands on the data and more time on the analysis and modelling.
(*) tapes had to be kept overnight in the machine room before reading - due to humidity