Leveraging the GDAL VSI for “raw” access to NetCDF/HDF5 files inside a Zip archive?

Hello all, I have an interesting conundrum.

My situation is the following: I have some data which is distributed as a zip archive of .nc files. The files are rather large, so I would prefer to not unzip the archive to read them. Since I need to read these .nc files with GDAL, I can leverage the GDAL VSI to access "/vsizip/$(zip_file)/$(nc_file)" and read the raster bands without issues.

However some of the data in these files is not exposed by GDAL, so I’m using the ZipFile and HDF5 packages to load the .nc file into memory and then process them, with something like

    r = ZipFile.Reader(zip_name)
    for f in r.files
        if f.name == h5_name
            bytes = read(f)
            h5open(bytes, "r"; name="MEM:$(zip_name)/$(h5_name)") do fid
                ret = get_all_hdf5_metadata(fid)
            end
        end
    end
    close(r)

This works, but is honestly somewhat consuming (in time and memory) since I have to read the entire .nc file into memory before processing.

GDAL itself provides what HDF5 calls a VFD to make it leverage GDAL’s own VSI to access the file(s). This is “relatively” straightforward to leverage in C(++) as long as one has everything at hand, by calling GDAL’s provided HDF5VFLGetFileDriver(), which registers the VFD with HDF5, and then pass to H5Fopen an H5P (property) with the driver set to the mentioned driver, as done in GDAL_HDF5Open().

So my obvious question would be: would it be possible to achieve this in Julia, assuming I have the respective packages installed? I assume the answer is “yes” provided one calls straight to C/C++?

Could you show an example? I mean, some SUBDATASET that gdainfo shows but that you can’t access from Julia.

Note, nc supports high level of compression and those files gain nothing in being zip’d.

You can use ZipArchives.jl to get a view of an entry in an archive if that is helpful. From what I can tell, HDF5 cannot open views, so you might need to do something hacky with GC.@preserve and unsafe_wrap.

Here is an example using JSON3. Importantly, this only works because zip_iscompressed(r, entry) returns false, meaning the entry is not compressed in the archive. If the entry is compressed, things get a lot more complicated.

julia> using ZipArchives, Mmap, JSON3

julia> r = ZipReader(mmap(open("example.zip")));

julia> entry = something(zip_findlast_entry(r, ".zattrs"))
2

julia> start = firstindex(parent(r)) + zip_entry_data_offset(r, entry)
132

julia> stop = start + Int(zip_compressed_size(r, entry)) - 1
245

julia> @assert !zip_iscompressed(r, entry)

julia> JSON3.read(view(parent(r), start:stop))
JSON3.Object{SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, Vector{UInt64}} with 3 entries:
  Symbol("time (s)") => 150.0
  :version           => "0.8.0"
  :uuid              => "37eee81f-88ae-4d11-b6b3-d38e1ccf0a08"

The files in question are EUMETSAT’s data files for the new FCI sensor mounted on MTG satellites. The information I’m interested in is stored in 0-dimensional (scalar) datasets that are ignored by GDAL (if you try to open them explicitly with gdal, which you can when you know the HDF5 path to the dataset, it will explicitly error out with the information about them not being supported).

True, but how these files are generated it outside of my control :sunglasses: Still, in this case there’s about a 10% reduction which is around 100MB per file. And since they are compressed, I’m afraid I can’t even use @nhz2’s trick 8-(

Do you have a direct link to one of such files?

GMT.jl has a gdalinfo function that returns a string with the same info as the CLI gdalinfo. Maybe you can parse that string to extract what you need.

And maybe MetopDatasets.jl can help too

Not sure if this will help, but it was a lightbulb moment for me when facing the problem of extracting one file from a very large remote Zip file. If the server for the files supports range requests (which any non-toy server does these days), you can download only the particular HDF5 file you want by reading the directory at the end of the Zip file to get the start and end in bytes of the compressed file you want. Then you use another range request to download and then uncompress that file.

I’ve recently coded such a file extractor for my own purposes here. The ZipArchive constructor reads the directory (downloading the last 65000 bytes) and the resulting instance is then passed as an argument to download particular images specified by their path in the archive. I’ve tested this on Zenodo, whose server would only admit to doing range requests if you actually tried one.

@joa-quim I’m afraid I’m not free to share the data directly, but it can be downloaded from the EUMETSAT Data Portal at EUMETSAT - Data Store

I appreciate the recommendations, but as I said GDAL does not present that data in any way (currently; I have opened an issue on GitHub about it). gdalinfo in particular doesn’t show it. (METOP is a completely different file format.)

The trick you suggest is quite interesting, but I’m afraid it doesn’t apply in my case. We want to process the content of the whole archive, so we do need to get the whole .zip anyway. But we’d rather avoid unpacking it if possible (which we can do insofar as we only need the GDAL-accessible information from the .nc files contained within, and also if we want the rest —at the expense of time and memory).

OK I’ve been looking into ccall() and this would do what I want if the symbols I wanted to access were exported by the gdal library. But they are not, so it seems I’m stuck in the current situation 8-(