ANN: GeoData.jl

These are all interesting comments. We should really work on this integration to make these dimension traits available when they make sense. There are various design choices that we could discuss in a GitHub issue.

Regarding the indexing with Int, I find it crucial in order to perform these statistical operations within general meshes. I am not sure yet what are the underlying assumptions in the array model, it certainly imposes some spatial structure, but we would need to investigate it more. My intuition is that there is a very large class of spatial data that could be represented with the array model, but some exotic domains like folded surfaces (e.g. Mobius strip) and faulted geological models (e.g. spatial discontinuities) may not be represented easily.

2 Likes

Great to see an xarray style package in Julia! If I have a (netcdf) dataset with lon -180 to 180, is there an easy way to change the lon to 0 to 360 when the file is read in?

1 Like

It’s actually a bug/shortcoming in DimensionalData.jl that that doesn’t work by just using regular indexing in Lon (if your array is 360 pixels wide in the Lon dimension):

A[Lon([181:360..., 1:180])]

There’s a PR coming through to fix that and the above does what you want, and the index is also shifted to match.

Edit: except you lat dim will still be wrong and out of order. But after this it will be a vector,
and you can do this to fix it:

index(A, Lon) .= 1.0:360.0

Its a tiny change for that to work, its just verging on undefined behavior as we don’t know what happens to the index order or sorting when you index with a Vector. It could (Edit: and actually does in the case) break use of searchsorted later.

For using selectors on the actual lat/lon values to do that we would need the Spherical mode we discussed above to treat the Lat dim as a circular index:

A[Lon(Between(0, 360))]

Would then just work. But… how do we detect that the netcdf has a spherical index? It might need to be something you would pass in manually.

I started an issue on GeoData for Spherical manifolds:

https://github.com/rafaqz/GeoData.jl/issues/60

I was thinking about the need for Int indexing too and how you will need that for generality. I guess the regular indices will still work as you expect, but indices larger than or smaller than the array axis bounds will just also work. So it might be ok. The trick will be finding the balance between regular array properties and spherical properties. Like you would still want to be able to broadcast over another non-spherical array of the same size and dimensions.

Not sure the following is expected. It seems I have to advance at least 1 month on the upper bound, in order to get 12 months. The same thing happens for any year of the upper bound.

A[Ti(Between(Date(1951, 1, 1), Date(1951,12,31)))] # returns 11 months
A[Ti(Between(Date(1951, 1, 1), Date(1952,1,15)))] # returns 11 months
A1 = A[Ti(Between(Date(1951, 1, 1), Date(1952,1,16)))] # returns 12 months

 dims(A1, Ti)
dimension Time (type Ti): 
val: DateTime[1951-01-16T00:00:34.816, 1951-02-14T23:59:43.616, …, 1951-11-15T23:59:12.896, 1951-12-16T00:00:32.768]
mode: Sampled: Ordered Irregular Intervals
metadata: NCDdimMetadata{String,Any}(Pair{String,Any}("units", "days since 1900-01-01 00:00:00"))
type: Ti{Array{DateTime,1},Sampled{Ordered{DimensionalData.Forward,DimensionalData.Forward,DimensionalData.Forward},Irregular{Tuple{DateTime,DateTime}},Intervals{Start}},NCDdimMetadata{String,Any}}

Thats expected for two reasons, I’ll expain.

First, Between(a, b) and Between(b, c) should always be disjoint sets if a < b < c, otherwise scripting with Between is janky. That means the range selected is values x >= a and x < b.

Second, by default cells are treated as Intervals for file loading. It’s the standard with GDAL, but point data will be detected if the flag is set for that. Netcdf is a little more complicated so it just always returns Intervals for now, which may have some issues.

Between selects whole intervals when the sampling mode is Intervals.
This means that if the data is intervals the sum of the returned intervals is guaranteed to not be larger than the difference between a and b, which is a nice property to have.

So to select all 12 months you need to use Between(a, b) where:

a <= Date(1951, 1, 16) && b >= Date(1952,1,16)))

You can also see the mode in your array output:

val: DateTime[...
mode: Sampled: Ordered Irregular Intervals

Your index is Irregular (as we can’t yet detect the step size from the netcdf, and its a vector) so the intervals are based on the next value in the index.

But your month index appears to be centered, which is very hard to work with as a month has no fixed time out of context , so there is no way of working with them as intervals. So maybe best to convert them to Points? You can do that with setdims, then they will behave as you expect.

A = setdims(A, rebuild(dims(A, Ti); mode=rebuild(mode(A, Ti); sampling=Points())))

Although I should make setmode(A, Ti, mode), setsampling(A, Ti, sampling) etc methods to simplify this. It’s all immutable so you cant just change the state, it has to be rebuilt.

Edit: I’m writing a simple set syntax to make it easy to do this https://github.com/rafaqz/DimensionalData.jl/pull/157

This might also be detectable from the netcdf if they are actually marked as points in the metadata somewhere, a PR doing that would be great.

The set methods look great! The standard for time metadata in NetCDF is explained here e.g. NetCDF Data Requirements. The time units attribute should follow CF conventions, and represent the middle time point of each measurement. I might investigate a few NetCDF files to see how true that is, and whether it’s worth GeoData.jl supporting that as default (for NetCDF).

1 Like

Ok it does represent the middle of the measurement. It’s such a horrible decision for processing dates. But I’ll flip the behavior to Center as standard for NetCDF, that’s definitely an oversight on my part caused partly by trying to avoid how hard that is to work with!

We’ll have to work out how to get the actual month interval from that, I guess we can query that with month etc and work in integers, you just cant use regular math on it like you can when it’s at the start of the month.

There is definitely a mix of formats floating around, I have a bunch of files where the time is at the start of the hour. But it doesn’t seem everyone making datasets is always across all the details of the standard.

Edit: part of this problem comes from not being able to parse the interval units from NetCDF to julia datetime objects. See: https://github.com/JuliaGeo/CFTime.jl/issues/8

If we knew what the time interval of the dimension was, we could get the interval edges (start/end of month etc) around the center point fairly easily. But currently it could be year, month, hour, seconds so you don’t know.

Edit2: for now I will also set the time dim to Points by defaults, as Irregular Intervals{Center} will break Between and Contains for DateTime, until there is a solution for detecting the interval span from netcdf.

Update: GeoData.jl 0.5.0 has arrived.

This is a large refactor and simplification of the internals, and a lot of new features building on the new changes.

  1. lazy loading. DiskArrays.jl is heavily utilized. This means files load lazily as much as possible, and things like broadcasts, view, permutedims, reverse etc are applied lazily when actually needed, and following the chunks of the underlying storage format. Besides making things generally snappier, this facilitates e.g. broadcasting from one larger-than-memory disk-based array to another.

  2. File type detection. Backends are now detected automatically for load/save. You just use GeoArray(filenam), GeoStack(filename) and GeoSeries(filenames) to load files, read to move them from disk to memory, and write to write them to disk, again using the extension to pick the backend.

  3. Many standard GIS methods have been added, see the docs.

  • aggregate/disagregate
  • mask/mask!
  • crop
  • trim
  • extend
  • mosaic
  • classify

These all work on GeoArray and multi-layered GeoStack, and many are lazy and can operate on larger-than-memory files.

  1. RasterDataSources.jl integration. This means downloading and loading common datasets is a one-liner, like:
using GeoData, Dates
st = GeoStack(WorldClim{Climate}, (:tmin, :tmax, :prec); month=January)
  1. Better plots. Lots of stylistic tweaks. GeoStack plots are tiled, plots subsample for fast plotting and plotting larger-than-memory files.

  1. GeoSeries can be made lazily from some dimension of arrays and stacks with slice, and rejoined with combine.

Anyway, try it out!

10 Likes

Awesome work! :slight_smile:

Plans on using Zarr.jl backend?

Well, there is this issue I made last year and never acted on :neutral_face:
https://github.com/rafaqz/GeoData.jl/issues/42

But I haven’t personally had a use case or many requests. I guess building on Zarr.jl and following how xarray adds spatial metadata is the way to go. Is that the kind of file you need to load?

1 Like

Yes, this is the kind of files that are now becoming more common in weather and climate domain, mainly due to the ease of using “cloud storage” and “remote access” for specific a subset of the data (e.g. no need to download the whole file if you only want a specific spatio-temporal subset).

Ok interesting. They’re still not very common in ecology/biogeography.

If you can link some publicly accessible example files that would help. Mostly the work is in implementing and testing the xarray metadata spec.

Do you use any files that aren’t in the xarray format? just generic zarr?

@fabiangans : Do you have an example Zarr file somewhere in Zarr.jl ?

I have limited experience with Zarr files, but we are beginning to test it at my office. I think that the most stable path right now would be to implement what is used by xarray? Because Unidata seems to have incorporated Zarr support, but using a subset of the “pure” Zarr : Overview of Zarr Support in netCDF-C : Unidata Developer's Blog

Oh right. It seems that they use different metadata standards :frowning_face:

So we may have 2 formats to support. At least NCZarr will probably be close to netcdf…

I honestly don’t know when I will be able to implement this, if someone is keen to write a backend there are 4 others to copy. I’m also doing some refactoring of DimensionalData/GeoData that should make contribution easier soon.

1 Like

Regarding a Zarr example file, the obvious choice would be to use CMIP6 data hosted on GCS by Pangeo. This is one run I randomly picked for the unit tests, but there is a whole catalog of CMIP6 runs here https://storage.googleapis.com/cmip6/pangeo-cmip6.csv

using Zarr
zopen("gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/6hrPlev/psl/gn/v20170706/")

An alternative would be to wait until metadata standards of NcZarr and xarray-Zarr have converged in one or the other way and then directly read the data through a recent build of netcdf, so no need to have another depenedency on Zarr.jl.

2 Likes

An alternative would be to wait until metadata standards of NcZarr and xarray-Zarr have converged in one or the other way and then directly read the data through a recent build of netcdf, so no need to have another depenedency on Zarr.jl.

This is what I was hoping would happen. Possible this will need some work in NCDatasets.jl, but less than writing a Zarr backend that can read all the metadata, and it will be more widely useful to work in a NetCDF specific package.

Currently I think netcdf can load xarray metadata, but only if you specifically flag it.

Does that example file have xarray metadata?

Yes, it has xarray metadata, actually all examples that I have met in the wild use this convention.

I recently tried to use Zarr via the NCZarr code in NetCDF 4.8.1 and it did work in NCDatasets.jl (and quite likely with NetCDF.jl too) without code changes (at least so far) in these packages as the Zarr files can be access with the same NetCDF API. I put some notes here.

As alluded by Fabian the concept of named dimension (central in NetCDF) is handled differently in xarray than NCZarr (https://github.com/zarr-developers/zarr-specs/issues/73 ). Clearly, it would very useful to have named dimension in the Zarr standard (or at least Xarray and NCZarr adopt the same extension: https://github.com/zarr-developers/zarr-specs/issues/41).

In any case, NetCDF 4.8.1 supports also the xarray extension when a user uses the xarray mode.

I am wondering if anybody is interested in the STAC spec (https://stacspec.org/). It allows to make a catalog of data products (quite often stored in GeoTiff or Zarr). It is used in the pangeo community and on the Microsoft’s “planetary computer”. To me, it sounds like a fun community effort to implement this :smiley:.

4 Likes

That sounds promising! If it works NCDatasets. It should just work in GeoData.jl too.

As for catalguing assets, absolutely.

https://github.com/EcoJulia/RasterDataSources.jl

Does this for a lot of raster files, with focus on easy repeatable syntax just for rasters.

That could contibute somehow, it gives access to a a few million files currently.

1 Like