[ANN] GeoStats.jl v0.24

GeoStats

TL;DR :tada:

GeoStats.jl v0.24 comes with full support to general unstructured meshes via the new Meshes.jl project, and provides powerful geospatial queries built with the Query.jl language.

Geostatistics on general meshes

For those coming from other fields, geostatistics is the branch of statistics that deals with data distributed over geospatial domains. A challenge in this field is to operate efficiently with large 3D meshes, which are discretizations of these domains into simple geometric objects such as triangles, tetrahedrons, etc.

Over the last months, I’ve been working actively on efficient representations of geostatistical problems such as geostatistical estimation (e.g. interpolation), simulation, and more recently, geostatistical learning over meshes, and this research has led to a very clean API that is tightly integrated with the Julia data science and machine learning ecosystems.

To illustrate these features, let’s consider any table implementing the Tables.jl interface:

julia> using Tables

julia> table = (a=rand(100), b=rand(100)) # a Tables.jl table
(a = [0.307732211237423, 0.803718835997471, 0.08411713448982439, 0.4830139406337839, 0.89972361406729, 0.4838048624166984, 0.6572940355048909, 0.19799338569165803, 0.7805574255598815, 0.8990875618632652  …  0.07242337288301681, 0.7676977890604322, 0.8262285634679147, 0.03258811191692268, 0.5968713113063921, 0.2685941981073232, 0.4662204960862124, 0.5170359078991946, 0.5012109114845784, 0.11183068325228596], b = [0.9965173138643892, 0.11553726662208041, 0.20159410317223725, 0.7313178905382756, 0.557382816267495, 0.2856813840254393, 0.5016940394671034, 0.7547849171693573, 0.1820419942447178, 0.49498420120152087  …  0.003864021720924482, 0.3285419806520167, 0.26864603835206635, 0.23492724473816962, 0.011024989056165335, 0.8322119339786553, 0.5203882014765839, 0.6062219376724753, 0.8817101294035377, 0.6451100662465474])

We can easily georeference this table on various kinds of geospatial Domain defined in Meshes.jl, e.g. PointSet, CartesianGrid, GeometrySet, etc.

For example, we can georeference the table on a PointSet:

julia> using GeoStats # v0.24

julia> π’Ÿ = georef(table, PointSet(rand(Point3, 100)))
100 GeoData{3,Float64}
  variables
    └─a (Float64)
    └─b (Float64)
  domain: 100 PointSet{3,Float64}

This object implements the Meshes.jl Data trait, and consequently it can also be seen as a table:

julia> first(Tables.rows(π’Ÿ), 3)
3-element Vector{NamedTuple{(:a, :b, :geometry), Tuple{Float64, Float64, Point3}}}:
 (a = 0.307732211237423, b = 0.9965173138643892, geometry = Point(0.9489465238556838, 0.6840987940920451, 0.09563305979228809))
 (a = 0.803718835997471, b = 0.11553726662208041, geometry = Point(0.055798761161550026, 0.13459614756211802, 0.06163990143241693))
 (a = 0.08411713448982439, b = 0.20159410317223725, geometry = Point(0.39791182546602677, 0.47022283912414164, 0.6568206304457964))

where an additional column called geometry has been generated on the fly with the points in the PointSet.

The main advantage of this design becomes clear when we want to georeference data on large meshes. For example, we can create a large CartesianGrid without allocating memory:

julia> @allocated CartesianGrid(100,100,100)
0

julia> dom = CartesianGrid(100,100,100)
100Γ—100Γ—100 CartesianGrid{3,Float64}
  minimum: Point(0.0, 0.0, 0.0)
  maximum: Point(100.0, 100.0, 100.0)
  spacing: (1.0, 1.0, 1.0)

and (lazily) load a large table from a database to be georeferenced on this grid:

julia> tab = (a=rand(1000000), b=rand(1000000))
(a = [0.17820079093053254, 0.7944333803391856, 0.8376111603886238, 0.8330377164106701, 0.3823010119717243, 0.8544562649114289, 0.934674557908652, 0.0993039666891864, 0.040738538738726904, 0.341148837826444  …  0.21904461787804985, 0.20420960818944778, 0.5521403980538013, 0.317198105419201, 0.6136359539215912, 0.9147469824853054, 0.7982188591455464, 0.7914957990358054, 0.7286901016022427, 0.43962802801927614], b = [0.5392095325730726, 0.257817791509767, 0.4145000606294422, 0.9371621989111043, 0.3336561075167659, 0.043759436728378454, 0.12460964715500866, 0.6513118917698182, 0.5309687456864345, 0.09521258333360771  …  0.16789605649058847, 0.3147624093439827, 0.055768891575411095, 0.6452352355034459, 0.3038137169208477, 0.005174220791681394, 0.7088761618564241, 0.6580499412221101, 0.6846371637946289, 0.7824615967594442])

julia> π’Ÿ = georef(tab, dom)
1000000 GeoData{3,Float64}
  variables
    └─a (Float64)
    └─b (Float64)
  domain: 100Γ—100Γ—100 CartesianGrid{3,Float64}

The georef operation will just create a wrapper with the table and the mesh. Similar to what we did in the PointSet case, we can treat the result of this operation as a table:

julia> first(Tables.rows(π’Ÿ), 3)
3-element Vector{Any}:
 (a = 0.17820079093053254, b = 0.5392095325730726, geometry = Hexahedron(Point(0.0, 0.0, 0.0), Point(1.0, 0.0, 0.0), Point(1.0, 1.0, 0.0), Point(0.0, 1.0, 0.0), Point(0.0, 0.0, 1.0), Point(1.0, 0.0, 1.0), Point(1.0, 1.0, 1.0), Point(0.0, 1.0, 1.0)))
 (a = 0.7944333803391856, b = 0.257817791509767, geometry = Hexahedron(Point(1.0, 0.0, 0.0), Point(2.0, 0.0, 0.0), Point(2.0, 1.0, 0.0), Point(1.0, 1.0, 0.0), Point(1.0, 0.0, 1.0), Point(2.0, 0.0, 1.0), Point(2.0, 1.0, 1.0), Point(1.0, 1.0, 1.0)))
 (a = 0.8376111603886238, b = 0.4145000606294422, geometry = Hexahedron(Point(2.0, 0.0, 0.0), Point(3.0, 0.0, 0.0), Point(3.0, 1.0, 0.0), Point(2.0, 1.0, 0.0), Point(2.0, 0.0, 1.0), Point(3.0, 0.0, 1.0), Point(3.0, 1.0, 1.0), Point(2.0, 1.0, 1.0)))

However, this time we notice Hexahedron geometries built on the fly. This lazy construction of geometries is an important feature for large-scale (e.g. global-scale) geostatistics, which can easily involve more than hundreds of millions of geometries in 3D.

The result of georef can be used to define geostatistical problems such as geostatistical learning, which can be solved with any classical (non-geospatial) learning model from MLJ.jl. For a concrete example, see our https://juliaearth.github.io/GeoStats.jl/stable/workflow.html.

Geospatial data science

Wouldn’t it be great if we could more easily leverage our Julia data science stack on geospatial data? In particular, wouldn’t it be great if we could perform geospatial queries using a powerful query language?

In this release, we provide integration with the amazing Queryverse, meaning that the result of georef is query-able and that queries can involve geometric operations such as converting the Hexahedrons of the table into their centroids to produce a new geospatial table:

using Query

julia> π’Ÿ |> @mutate(geometry = centroid(_.geometry))
1000000x3 query result
a         β”‚ b         β”‚ geometry            
──────────┼───────────┼─────────────────────
0.178201  β”‚ 0.53921   β”‚ Point(0.5, 0.5, 0.5)
0.794433  β”‚ 0.257818  β”‚ Point(1.5, 0.5, 0.5)
0.837611  β”‚ 0.4145    β”‚ Point(2.5, 0.5, 0.5)
0.833038  β”‚ 0.937162  β”‚ Point(3.5, 0.5, 0.5)
0.382301  β”‚ 0.333656  β”‚ Point(4.5, 0.5, 0.5)
0.854456  β”‚ 0.0437594 β”‚ Point(5.5, 0.5, 0.5)
0.934675  β”‚ 0.12461   β”‚ Point(6.5, 0.5, 0.5)
0.099304  β”‚ 0.651312  β”‚ Point(7.5, 0.5, 0.5)
0.0407385 β”‚ 0.530969  β”‚ Point(8.5, 0.5, 0.5)
0.341149  β”‚ 0.0952126 β”‚ Point(9.5, 0.5, 0.5)
... with 999990 more rows

We can pipe as many queries as needed to clean our data for a geostatistical problem. When we are done with the queries, we can pipe the result back to a GeoData object, which is the wrapper type returned by georef:

julia> π’Ÿ |> @mutate(geometry = centroid(_.geometry)) |> GeoData
1000000 GeoData{3,Float64}
  variables
    └─a (Float64)
    └─b (Float64)
  domain: 1000000 PointSet{3,Float64}

The geospatial domain will be automatically updated. As can be seen in this example, the pipeline converted a CartesianGrid of Hexahedron to a PointSet of centroids, everything in pure Julia. The associated values for a and b could have been adjusted accordingly in a real application.

Contributing

We invite users to try the new release and provide feedback. We have a GSoC proposal for advanced Julia programmers (no exceptions) that have experience with machine learning and geospatial data: https://julialang.org/jsoc/gsoc/GeoStats

Acknowlegements

I would like to thank @quinnj and @davidanthoff for their help with tabular interfaces, and the contributors to the Meshes.jl project for various fruitful discussions, including @serenity4 @stillyslalom @hameye and others.

13 Likes