ANN: GeoData.jl

It’s been a while coming, but I’m happy to announce that GeoData.jl is finally registered!

GeoData is a package for loading and manipulating spatial data - predominantly raster arrays and multilayered stacks of rasters, or even whole directories of files arrays or stacks. But it will eventually handle other kinds of spatial data to allow easy interop/conversion between formats.

The key idea is that a set of GDAL arrays can be accessed with identical syntax to netcdf stacks or custom HDF5 data formats - so we can build packages on top of GeoData.jl that don’t have to specialise code for the underlying format. Next week I’ll release GrowthMaps.jl as an example of this.

GeoData.jl also builds on DimensionalData.jl (which is getting quite mature) so all spatial data can be indexed and subsetted using dimensional syntax. A lot of work has gone into selectors like Between and Contains to allow accurately subsetting data that represents intervals as well as points - which is a lot of spatial data.

Anyway heres a simple example, there are lots more in the readme and docs as well.

julia> using GeoData, NCDatasets

julia> url = "https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc";

julia> filename = download(url, "tos_O1_2001-2002.nc");

julia> A = NCDarray(filename)
NCDarray (named tos) with dimensions:
 Longitude (type Lon): Float64[1.0, 3.0, …, 357.0, 359.0] (Converted: Ordered Regular Intervals)
 Latitude (type Lat): Float64[-79.5, -78.5, …, 88.5, 89.5] (Converted: Ordered Regular Intervals)
 Time (type Ti): DateTime360Day[DateTime360Day(2001-01-16T00:00:00), DateTime360Day(2001-02-16T00:00:00), …, DateTime360Day(2002-11-16T00:00:00), DateTime360Day(2002-12-16T00:00:00)] (Sampled: Ordered Irregular Intervals)
and data: 180×170×24 Array{Union{Missing, Float32},3}
[:, :, 1]
 missing  missing     missing  …  271.437  271.445  271.459
 missing  missing     missing     271.438  271.445  271.459

Now plot every third month in the first year:

julia> using Plots

juila> A[Ti(1:3:12)] |> plot

27 Likes

That sounds fantastic - I look forward to the first Julia climate or satellite plot seen on television!
This is goign to sound bad though…
Should this package repository be under the Geo organisation

1 Like

I’m a member of JuliaGeo. It might be there one day.

2 Likes

Thank you for the contribution @Raf, amazing work.

Could you please clarify the assumptions underlying the spatial data you have in mind? In order for these indexing operations to be useful you are assuming that the data is like an “image”?

In GeoStats.jl I have converged in a format representation that is a super set of the image representation in the sense that it considers general finite element meshes, and where satellite images are one type of mesh (structured grids). It would be nice to try to implement the interface at some point to see if we can plug in your types from GeoData.jl into GeoStats.jl and have it work out of the box.

Currently dimensions are only for array axes - a single Dimension type for each axis.

But the idea is to attach these dimensions to vector/point data and dataset columns so we have explicit types tracking the order of the coordinates/columns etc. Then you can just convert between types with no information loss.

1 Like

Do you mean a superset in terms of inheritance? I like that GeoArray <: AbstractArray as it gives interop with the rest of the julia ecosystem.

The Dimension Tuple is like a very complex trait that specified the dimension order and names, but also index mode (Categorical/Sampled/NoIndex), sampling type (Points/Intervals) and a bunch of other details about the index/array axis/dimension, or anything you want to add types for. So other spatial types could be connected to GeoData.jl by having a dims field that returned the dimensions represented by the data. But it might not be stored as an array.

It would be great to work towards interop - also hopefully integration with geostats for interpolation etc. With all the trait information we know a lot about the data so we should to be able to dispatch to sensible methods by default.

Edit: A fun thing about the Selector in DimensionalData is that we can probably use them to subset any kid of spatial data - they’re like queries (theres even a Where selector…) and they could equally select from vector of points or from dataframe columns as from a multi-dimensional array - using the same syntax.

1 Like

I mean in terms of spatial configuration. For example, non-Cartesian worlds like spherical manifolds (they have no firstindex nor lastindex) in which we could be interested in interpolating values and finding neighbors that “wrap around” the sphere. cc @Datseris

I like this generality and the Dimension types you introduced. I wonder if they incur any performance penalty while processing the coordinates. I understand you can index the arrays with integers as usual, and optionally in user code use the dimension types to make the script more readable. Do you see algorithms implemented with the dimension types, or they are mostly intended for end use in applications where clarity of intention is important?

That is how I see it as well. If we could dispatch geostatistical methods on specific carefully-tuned data structures, we can gain a lot of modeling benefits. For example, the way we currently do slices in GeoStats.jl with real coordinates leverages the underlying spatial configuration of the data. We can slice an “image” and by definition we still get an image:

I = geostatsimage("Strebelle") # some image
250×250 RegularGrid{Float64,2}
  variables
    └─facies (Float64)
S = slice(I, 50.5:100.2, 41.7:81.3) # slice with real coordinates
49×39 RegularGrid{Float64,2}
  variables
    └─facies (Float64)
plot(plot(I), plot(S), link=:both)

But we can also slice a point set and in this case we just get another point set:

P = sample(I, 100) # some point set
S = slice(P, 50.5:150.7, 175.2:250.3)
15 PointSet{Float64,2}
  variables
    └─facies (Float64)
plot(plot(P), plot(S), link=:both)

This looks really really promising! However, I’m still a bit confused as to the differences between this package and NCDatasets.jl. I’d also like to know how GeoData would handle a NetCDF file with multiple spatial datasets (like SST and Air Temperature as separate variables).

I wonder if they incur any performance penalty while processing the coordinates.

DimensionalData.jl indexing is all compile-time, so there is no necessary performance overhead to using them if it’s done well (see the @btime in the readme).

As an ecologist I know very little about spherical manifolds! so I guess I’m mostly assuming cartesian worlds or projections to them. Handling rotations and affine transformations is fairly straightforward, but I’m not sure what it would take to represent spherical manifolds.

Do you see algorithms implemented with the dimension types, or they are mostly intended for end use in applications where clarity of intention is important?

Apologies I’m not sure what you mean exactly. The Dimension types are for end use by users but also do most of the heavy lifting in these packages. They hold the index, but also are used for indexing into it so behind the scenes we can just compare types to match/reorder them. They have a few set fields but by default they’re just a singleton you can use as a dimension marker however you like.

But we can also slice a point set and in this case we just get another point set:

P = sample(I, 100) # some point set
S = slice(P, 50.5:150.7, 175.2:250.3)

This could be done like this with GeoData:

P[X(Where(somefunction), Y(Between(50.5, 150.7))]
# Or if you know the order
P[Where(somefunction), Between(50.5, 150.7)]

Using a range in your example suggests returning evenly spaced points/intervals along the range. Is that your intention?
In GeoData.jl/DimensionalData.jl that would only work if every point in the range actually exists in the index, as the At selector is assumed. Otherwise you would use Between.

The semantics of slice in our case is more general and equal to “find all points inside the rectangle geometry”. It is implemented as follows under the hood:

slice(object, ranges::Vararg) =
  inside(object, Rectangle(first.(ranges), last.(ranges)))

We use the traits of the object to dispatch the appropriate implementations. These traits are under active development. Hopefully we will converge into something soon, and document them properly. A challenge we are facing in this effort is the lack of built-in support for traits in the language. So far the idea is to provide a few macros so that people can more easily implement the traits on demand for their types.

This package wraps NCDatasets.jl, ArchGDAL.jl, (and provides GRD and SMAP datasets locally).

The difference is its a standardised interfaces to multiple data sources that load and operate in a structured way - you can even write from one type to the other.

To load a multilayered Netcdf you can use

s = NCDstack(filename)
temp = s[:air_temperature]
sst = s[:sst]
plot(sst)

And sst will be a standard GeoArray that will plot properly, and you could write it to tiff, etc. But you could also do:

s = GeoStack((sst=GDALarrray(sst.tiff),
              air_temperature=GDLArray(airtemp.tiff))

And have functionally the same object that another package can use without knowing that the source files are tiff, but also not load them eagerly (which is often impossible with large datasets). Standard indexing is identical for both, but also a method like modify(CuArray, s) would copy all the arrays from disk to the same object on the GPU in both cases, and aggregate(mean, s, (Lat(10), Lon(10)) would do the same aggregation to both.

2 Likes

I’m not sure about this approach to traits if you need macros… A good property of Dimension is that they are just structs with sub-components that can be pieced together with regular constructors ad-hoc, but they can also be actual “traits” that are just returned by a method for use in dispatch when the dimensions of an object have fixed properties.

The mode field of the Dimension controls most of the dispatch in DimensionalData.jl and GeoData.jl. - The behaviors specified can be very complicated with this setup. See selectors.jl in DimensionalData. The Between selector needs to dispatch on Order (Forward/Reverse), Span (Regular/Irregular) and Sampling (Points/Intervals) with (Start/Center/End) “locii”.

But you can extend any of the behaviors with a new type and a few lines of code. Projected mode in GeoData is an exampled of this - those few lines do conversion to another projection so you can just index in lat/lon etc.

Also (it’s probably not clear by the readme) GeoData.jl standardises order of the array/index so the plot always plots the right way with any data source even though they are stored in different orientations.

Indexing can always use A[Lat(45.0), Lon[Between(20,40)] etc without the user having to know when the netcdf index is actually backwards or forwards…, or even what the projection is in GDAL!! if you specify usercrs=EPSG(4326) you can index with it independent of the underlying projection, the conversion is automatic.

1 Like

We also have structs for the traits, that is not the issue. The issue is not being able to add methods to functions we don’t own like Base.join. For example, we have a trait for spatial data that happens to be called GeoData and would like to define a behavior for Base.join(obj, obj) where {geotrait(obj) == GeoData()}. However, this is not possible without type piracy. The solution is to let the owner of the type use a macro to implement the trait. The macro will add these dispatches for Base.join for the specific user type as opposed to a collection of types satisfying the trait.

Oooooooh thanks so much! I definitely will check this out! It looks really really interesting!

Aah right. So integrating GeoData.jl with GeoStats.jl would involve defining a set of traits on GeoArray etc.

In order to avoid the type piracy problem with Base.join, it might be easier to pick a different name instead of join. DataFrames.jl decided to do this by replacing join with the more specific functions leftjoin, innerjoin, etc. Those particular names might not make sense for GeoStats, but maybe there’s a different name that would work?

It’s too bad that Base.join wasn’t named stringjoin

That’s what I was thinking… At some point using your own methods is easier… even though it increases the name space, which I know we all hate to do.

Yeah, renaming the function kind of defeats the purpose. I considered renaming functions in the framework to start with an s like sjoin for spatial join, but then we become isolated from the rest of the ecosystem that does not use sjoin in their algorithms. Ideally, we should be able reuse names and make algorithms generic to work with other data types that we didn’t predict.

Sure, I don’t really understand the for/ against well enough really. I’ve just been trying to keep all object traits in one method dims for simplicity, and that strategy wont work for other peoples methods.

But I was thinking about handling your non-cartesian worlds. I’m assuming that this is still just an array or a vector of points underneath? So the issue is specifying how the wrapping behavior work. The dimension could have a mode trait Spherical that allowed selecting ranges across ends of the underlying array. Between, Near and Contains selectors on Spherical could then act in spherical ways - the nearest point could be at the other side or the array, then end cells could “contain” values from the other end of the index you could index with Contains, and Between could select subsets accross the ends of the array. It’s probably not too much code to implement that. If I understand the requirement correctly. There is some tension between abstract representations and the raw data structures with something like this.

Indexing spherically with Int directly is not possible in GeoData.jl because I try to maintain standard array indexing behaviour. But you you could make a selector for that too, - say Wrap(i) that does the conversion for you. Using a selector and getindex instead of some other wrap method means you could eg. select a slice of the time dimension at the same time, or filter some other dimension with Where selector, etc.