Lazily reading from force data cube

Hi!
I would like to read data lazily from a force datacube. Force is a processing pipeline for satellite images and it stores the outputs into a bunch of folders. Each folder (“tile”) covers a rectangular area and contains a set of tif files for individual time steps. Each tif covers the whole area of the tile, including lots of “nodata” - the data is incomplete in space and time.

My dataset is the whole available sentinel (10m resolution) imagery for Germany from 2017 until present, in total around 15-20TB, 10k files, stored on a NAS. I now want to cut out 60000 small snippets, each covering maybe 30x30 pixels; I want to get the data for all available points in time with the option to filter for certain months only.

How would you solve this task? Apart from julia: Are there any services that present APIs for such stuff? I’m quite new to satellite image processing…

My idea was to use Rasters.jl and writing a new package on top of it which allows me to use getindex cube[time, x, y, bands] (or similar) to filter time and region. For spatial filtering one could store the extents of all rasters in an RTree - a query to the tree would return all intersecting rasters, which could then be filtered for time. The getindex call could then be translated to all remaining files to load the data from disk (finally filtering out nodata areas). This sounds complicated; are there maybe better file formats to store these amounts of data more conveniently? Maybe I can prune all tifs to their data extent and stuff them into a ncd or so?

All help is appreciated :slight_smile:

You can do most of this fairly easily with Rasters.jl? I’m not sure whats missing. Maybe the cube file loading could be easier?

If you have already tried this, make a github issue for Rasters.jl with your script and we can work out what the pain points are and what needs to be added.

Filtering for time and region is easy with the DimensionalData.jl indexing that Rasters.jl uses.

crop in Rasters.jl is lazy and should work on huge data sets. It will still be slow as it has to load every chunk and check the value, but it shouldn’t use all your memory if your files are properly chunked. If not make an issue.

Edit: I missed the NAS part. Yes this may e pretty slow over a network. But remote files work fine with Rasters.jl/GDAL.jl, even from URL. But things will be much faster if you know the coordinates of the chunked data already. The total file size isn’t relevant if the bits you want are only 90 pixels. If you have to find where the chunks are, I imagine you have to search everything? (there should be a way of detecting compressed chunk size but I’m not sure there is)