Hi!
I would like to read data lazily from a force datacube. Force is a processing pipeline for satellite images and it stores the outputs into a bunch of folders. Each folder (“tile”) covers a rectangular area and contains a set of tif files for individual time steps. Each tif covers the whole area of the tile, including lots of “nodata” - the data is incomplete in space and time.
My dataset is the whole available sentinel (10m resolution) imagery for Germany from 2017 until present, in total around 15-20TB, 10k files, stored on a NAS. I now want to cut out 60000 small snippets, each covering maybe 30x30 pixels; I want to get the data for all available points in time with the option to filter for certain months only.
How would you solve this task? Apart from julia: Are there any services that present APIs for such stuff? I’m quite new to satellite image processing…
My idea was to use Rasters.jl and writing a new package on top of it which allows me to use getindex cube[time, x, y, bands]
(or similar) to filter time and region. For spatial filtering one could store the extents of all rasters in an RTree - a query to the tree would return all intersecting rasters, which could then be filtered for time. The getindex call could then be translated to all remaining files to load the data from disk (finally filtering out nodata areas). This sounds complicated; are there maybe better file formats to store these amounts of data more conveniently? Maybe I can prune all tifs to their data extent and stuff them into a ncd or so?
All help is appreciated