ANN: DimensionalData.jl and GeoData.jl

I’ve put these packages together over the last few months with the aim of standardizing and abstracting the use of geospatial data in Julia. DimensionalData.jl is a reboot of AxisArrays.jl for greater flexibility and abstraction, also providing some functionality found in NamedDims.jl like using named dimensions in most relevant base and statistics methods. I kept it separate from the spatial work in case other people find it useful. It’s pretty fast - Indexing single values using dimensions has no runtime overhead. Other methods vary, rebuilding the new dimensions does have a small cost. Methods like eachslice gain type stability using dimensions, and are actually much faster than the base implementation.

GeoData.jl extends DimensionalData.jl and provides three key abstractions: AbstrctGeoArray, AbstractGeoStack and AbstractGeoSeries. Stacks act like a NamedTuple of AbstractGeoArrays, and series are dimensional arrays of stacks or arrays. They may contain realised in-memory arrays or just paths to files to be lazily read from when required. The point is the data manipulation code will always be the same no matter what or where the underlying data is, so you can just pass an array, stack or series to another package and it can extract the information it needs without knowing anything about GDAL or the NetCDF format.

Neither packages are officially released yet, but they are working pretty well if people want to try them out. They have to be released soon so I can release all the modelling packages that depend on them, but comments and reflections on the strategies I’ve used would be great before that.

For an example of what these packages can do, these lines load a NetCDF file and plot the mean sea surface temperature for Australia in the second half of 2002

stack = NCstack(filename)
dimz = Time<|Between(DateTime360Day(2002, 07, 1), DateTime360Day(2002, 012, 30)),
       Lat<|Between(-45, 0.5), Lon<|Between(110, 160)
stack[:tos][dimz...] |> x->mean(x; dims=Time) |> plot

There are more examples here.

Selector wrapper like Between, At and Near select indices from the dimension values. they are a little more verbose than syntax like [x .. y] used elsewhere but it’s very clear what they do, and they can be extended to add other selectors you might need (most of the work is in recursive methods, not @generated so you can just use dispatch). Dimension names can also be added with a macro, and dims have metadata field (usually Nothing) that can store things like dimension units and other details.

GeoData.jl currently has in-memory GeoArray, and disk-based NCDatasets, GDAL, and SMAP HDF5 backends. These are not complete implementations at this stage but are still pretty useful. They just won’t handle complex projections or niche data types, and there are also some inconsistencies in the base packages that need to be fixed or worked around.

25 Likes

That’s pretty awesome! Congratulations!

Is there any plan to use a scheduler such as Dagger?

If someone else wanted to implement it!

Lazy loading of tiff/netcdf files to keep ram use down is about as much performance tweaking as I usually need.

What do you use Dask/Dagger for?

That would be for production scale workflow where for instance, one would need to do some post-processing on a large set of climate simulations for instance. Calculations is done on small (or large) clusters. Computing indices from large climate datasets would be another example.

xarray uses dask under-the-hood and I’d say that it’s quite performant. I can have similar performance with Julia, but with lot more effort in terms of discretizing the climate simulations with shapefiles (etc…). If we could, on top of Julia’s JIT, uses Dagger, that would be a killer feature. Automatic dispatch of calculations with compiled code. I must admit that’s over my technical skills so far.

Ok cool. We can probably just wrap a DArray from Dagger in a DimensionalArray/GeoArray and it will just work. If there are more moving parts than that you could define a new array type with dims methods. DimensionalData.jl is pretty flexible like that.

1 Like