Screening Interest in a unified Vector/Raster Package, akin to R Terra

Hello all,
I haven’t been active in Discourse but I’ve been working with Julia for scientific computing for quite a few years at this point.

I want to know what the interest is in a JuliaGeo project/effort to harmonize/shift the JuliaGeo ecosystem to allow the construction of a singular, unified package like R’s Terra.

One of my big pet peeves (and barrier to full adoption) is that the geospatial ecosystem is not as robust, simple, and interactive as the R ecosystem with Terra. Right now, I feel the ecosystem is made up of disparate, and tricky to interoperate packages. For example, my current workflows for scientific computing are an elementary operation of cropping layered rasters (of different CRS) by an extent/polygon extent then transforming these into dataframes/arrays for further analysis. In R, this is five lines of code that are very easy to read and require only Terra. Additionally, Terra easily lets people throw a wall of stats and operations on vector and raster objects on top of these simple data manipulations.

In Julia this operation would require atleast two packages and knowing how to translate between the packages. In the case of GeoDataFrames and Rasters, it’s easy, but I’ve found Shapefile to be trickier, ArchGDAL is verbose, GeoInterface lacks higher operations, and there are lots of, imo redundant packages (such as GeoJSON.jl, Shapefile.jl, etc. AFAIK these can be handled into abstracted GDAL based workflows with little to no change in code). Additionally, the documentation for these packages are quite sparse and typically show limited “translations” between related packages.

Many R geospatial analysts I know typically load in and modify geospatial data using geospatial tools. Then they convert those to other datatypes for analysis, but in this process they take compiled, modified in place objects, and convert it to slow interpreted objects. They use R because of its simple interaction with these objects and extensive statistical libraries. They then complain, however, about slow analysis. In one advanced geospatial analysis project, I had to pull teeth to convince them to do the data processing in Julia because they did not know Julia. However, their R code would have taken 200 days of compute hours. Meanwhile, Julia took 20 with simple optimizations while preserving DataFrames, type flexibility, and code clarity that they realized was just as easy to read as R. There is a space here for moving analysts to faster Julia workflows.

I totally understand that Julia works best with small, concise modules. I totally understand that some of this is an adoption/skill issue. However, my sentiments echo those of other geospatial analysts I know. Additionally, a lot of the questions in this topic are quite elementary, which, to me, shows a potential barrier to adoption that is broader than my circle.

A wrapper to existing Julia packages or a ground up suite of packages leveraging Spat objects (Maybe something like “Spats.jl” that implements the spat packages, then other small modules that operate on those objects) to match the Terra ecosystem could potentially help bring these analysts to Julia.

I want to know what the interest is in developing such a suite or wrapper to existing packages would be.

8 Likes

CC @asinghvi17 !

2 Likes

Hey! Glad to see more interest in a nice user experience. I definitely agree that a single meta-package can make it a lot easier for the first-time user. Julia packages tend to develop in a dependency-shy and modular way but there are situations where that shouldn’t be the case, and IMO this is one of them.

FYI, we are writing a book jl.geocompx.org (contributions are trickling in) that should make these workflows a bit easier to comprehend coming into the language. But the main problem remains.

Let me give a bit of background into my view of how this evolved:

Vector data

To start with a little background, the Julia (vector) geospatial ecosystem sort of evolved around fast and low-touch data access. So we have things like Shapefile.jl and GeoJSON.jl that both define their own geometry types that coincide with the specifications of those file formats.

Then, there are quite a few other pure-geometry packages like GeometryBasics.jl that define their own geometry types according to their own schemas.

As such, GeoInterface.jl evolved as a common interface to extract sub-geometries and coordinates from any geometry object, and query some metadata from it (spatial extent, CRS, geometry column for tables, etc). It has some other features but those are secondary these days, and not really implemented well across the ecosystem.

Over the last year or so, quite a bit of effort has gone into GeometryOps.jl as a package that can do “vector operations” (geometric comparisons like intersects, touches, disjoint, etc, polygon set operations, triangulation, voronoi, etc) and do them very well, in an efficient way, on any GeoInterface-compatible geometry. This is chugging along quite well I think, but still needs some time to be polished and get all the functionality it needs to equal something like terra.

GeometryOps does already have an order-of-magnitude improvement over Terra in terms of performance, and is more correct in cases where floating point accuracy is a problem. Here’s a cheeky benchmark image:

and things are probably a hint faster now as well.

Raster data

There are a few raster packages in Julia ([NC/TIFF/Zarr]Datasets, GeoArrays, etc) but the main one IMO is Rasters.jl which is built atop of DimensionalData.jl. It’s a terrific package with some capabilities which literally do not exist anywhere else, and an order of magnitude faster in most tasks (see the image below):

and Rasters.jl also works with any GeoInterface.jl compatible geometry or table (including DataFrames!) when geometries are required (in functions like rasterize or zonal).

Dependency issues

GDAL and similar binary dependencies can be a bit fragile and hard to install. If a library requires GDAL and it isn’t working on your system - tough luck! If Windows antivirus interferes with the binary - oops!

Similarly, there’re a lot of packages that add extra bloat that the library doesn’t necessarily need to perform its core functions. So we farm those out to package extensions, but then a user needs to load them. Error messages are pretty clear these days but I can see how the initial experience can be frustrating.

Maybe we need some metapackage that loads all possible optional dependencies (GDAL, Proj, NCDatasets, ZarrDatasets, etc) and reexports whatever needs to be reexported. That’s the approach that e.g. SpeciesDistributionToolkit.jl takes, or even the GeoStats.jl / Meshes.jl ecosystem - but IMO those ecosystems are like operating in a parallel universe, or sometimes a domain-specific language that’s not quite Julia. Something a bit less all-encompassing than those would be good to have, but that’s up for discussion!

Common utilities

There are packages like GeoFormatTypes.jl (used mainly for CRS types) and Extents.jl that also have to be imported manually pretty often.

Namespacing issues

Since GeoInterface geometries can be anything, it’s hard to overload e.g. Base.intersects for them unless every single geometry provider does that - quite a hard sell. So in GeometryOps, we simply define unexported intersects and contains functions that users can access by GO.intersects or GO.contains.

Conclusion

A general metapackage that users can load would be pretty cool.

A counterpoint to this is that general metapackages remove the user’s incentive to understand what’s going on (“it just works, why should I care?”) when understanding the power of these tools allows them to unlock way faster and more interesting workflows.

Perhaps more education is a part of a potential solution - but at least making it so that users can simply using three or four packages instead of the seven or eight they need to right now would be a huge step.

cc @Raf @visr @evetion @tiemvanderdeure

3 Likes

@souma I believe that you missed the entire GeoStats.jl framework. It already works with all variations of raster, vector data and more. It is the most widely used and maintained meta package out there.

You can reivent the wheel as some people are trying to do in JuliaGeo or help instead.

To reply to this, GDAL is sometimes orders of magnitude slower than the native Julia I/O packages like GeoJSON or Shapefile (which the GeoStats ecosystem also uses). Even reprojection through GDAL is extremely slow, compared to either Proj or a pure Julia solution. Packages in R have switched away from GDAL for reprojection, for example.

The idea behind GeoInterface is to abstract away any limitation on geometry types, allowing you to focus on simply writing code that will work with any geometry. Of course if you prefer some specific layout in memory, you can create a geometry type that does that and GI.convert to it.

part of my comment was to see what the current perspective on these things are explicitly to NOT reinvent the wheel :expressionless:.

Thank you for the thorough and nice response! That’s good to see these conversations are already happening and there’s an effort to make this more accessible!

For the book, is there intent/ability to place that on the main julia geo pages? Frankly, I have the time and desire that I might help contribute to it because something like that would be awesome for an introduction to people. Plus, it would encourage me to move entirely to Julia workflows rather than my bastardized RCall into Julia workflows xD.

I do agree that there is a benefit to learning exactly how these things work. My best learning has been digging into source code and seeing exactly how the common packages I use were built.

But I have also been STUNNED by wayyyyy too many scientists/agency people’s reluctance to take that time to learn. As I think on this more, maybe that is a good place for Julia to sit. Have the options and meta packages that enable entry level simple things, but then the lower level options easily available to start poking into those finer details.

Which to be fair, it is ironic I am wondering about a Terra like metapackage when I have that EXACT issue of “not understanding the meat and potatoes” with so many R packages. Having to explain ordinal regression to an advisor :expressionless:.

Also, I was genuinely unaware of the relative I/O slowness of GDAL. And frankly, anything to make reprojection faster is a win in my book. I have to do some silly hacks to reproject the huge data I use in efficient ways.

Overall, glad to see this ecosystem is actively working on similar things I was hoping for, and I’ll spend some time finally diving into this stuff at a deeper level.

1 Like

Definitely in a month or two, although it’s not ready for prime time yet. I am currently working on a couple chapters in PRs and we’ll hopefully get a bit more progress once some other contributors are freed up.

That would be awesome! It’s always difficult to see the struggles a beginner might have when you’re deep into the ecosystem, so having your perspective would be great.

:sob: …thankfully, most people who use Julia are at this point pretty willing to experiment and figure things out!

You may be interested in the approach that FastGeoProjections.jl takes, although we might try to refactor that into a simpler approach for bulk reprojection. It’s only got some projections that the author needed, but can be up to 8x faster than even Proj thanks to some clever SIMD and loop vectorization tricks in Julia.

Just a couple of consideration.

  1. Raster analysis “needs” in Julia are different than in R, largelly because loops and array operations are natively fast and hence array representation of raster data can be fruitfully used.

  2. Lets hand over the Sahapefile formats to history. R.I.P.

3 Likes

Could you expand on your view of the overlap between JuliaGeo and GeoStats? e.g. specific packages, methods, approaches, etc? Why have two different ecosystems developed?

2 Likes

Thanks for starting this thread @souma.

Personally, I find myself struggling with the vector data than with raster data, where I think Rasters.jl comes close to terra in terms of maturity and completeness (and is much better in some areas, like data with more than 3 dimensions).

I’ve been using julia for geospatial data for a year and a half and have contributed for some time, but with vector data I still often cannot figure out what packages to use, or I need to use some GI.convert, which isn’t great.

What I think makes Rasters.jl great is that you use the same Raster object regardless of the underlying file type and so can interact with the in the same way. Even the fact that it prints the same is important here, I think. The need to import ArchGDAL or NCDatasets to load datasets isn’t a big deal, I think - you just get a helpful error and load the package, and that’s it. I’ve never interacted with these packages directly as a user, Rasters.jl makes that totally unnecessary

With vector data, I use different packages to load the data dependent on what format it is, and then get different-looking objects out - operations that work on a GeoJSON.FeatureCollection don’t necessarily work on a Shapefile.Table I think a package that unifies vector and raster operations goes against the modularity of Julia. But maybe we are lacking a Rasters.jl-like package for vector data.

3 Likes

Hi @haakon, I will share the true story of what happened so that you and other people can understand why these two separate ecosystems exist.

10 years ago I started a whole initiative in favor of developing a full software stack in native Julia for geospatial work. At that time specific people in JuliaGeo didn’t share these goals and pushed GeoInterface.jl really hard everywhere in order to interface with external C libraries (mainly through ArchGDAL.jl) for political reasons.

Years have passed and now the same people are advocating for native packages, throwing away GDAL and PROJ, which they defended so much in the past. You will see them citing packages like FastGeoProjections.jl but not citing CoordRefSystems.jl. You will watch JuliaCon presentations on GeometryOps.jl that intentionally leave out Meshes.jl from the conversation as the main existing project for native geometry processing in Julia today. The situation is so embarrassing that multiple people from the community contacted me in private to share their disapproval of the ongoing initiatives in JuliaGeo.

With that in mind, I hope you can understand that the situation is far from technical. It is mostly political. It is easy to enumerate technical criteria and conclude that the GeoStats.jl ecosystem provides the smoothest user experience with far margins:

  • Native support for CRS with unitful coordinates and measures thanks for CoordRefSystems.jl, saving end-users from common mistakes (e.g., treating lat/lon coordinates as lengths in calculations).
  • Native geometry processing in 2D and 3D unstructured meshes, grids, geometry sets (polygons, points, lines, etc.) thanks to Meshes.jl. This means that you can generalize your workflows to any geospatial domain, not just raster or 2D vector data, but also 3D models of the subsurface.
  • Full support for 2D and 3D geostatistical interpolation, simulation and learning. Including all learning models from the MLJ.jl stack.
  • Specialized geospatial learning models, including geostatistical clustering models (e.g., GHC, GSC, SLIC) from the geostatistics literature.

You can learn more about our ecosystem in the GDSJL book:

https://juliaearth.github.io/geospatial-data-science-with-julia

I maintain my view that GeoInterface.jl is not the way to go. It relies on various hacks that are not natural in Julia and is far from user-friendly. No matter what the maintainers do to try to keep up with the innovations we release in the GeoStats.jl stack, GeoInterface.jl will always treat geometries as black box containers of vertices via weird wrapper types with little to no type information. It is sad to watch the slow adoption rates of Julia in the geo world because of the persistence on this bad idea.

On the positive side, we are seeing wide adoption of the GeoStats.jl ecosystem in industry and academic institutions:

  • I went to Stanford 3 weeks ago to train 20 PhDs + postdocs on advanced geostatistical modeling with GeoStats.jl We had very positive feedback.
  • I trained a couple of undergraduate students from Earth sciences in Brazil this year, and guess which geo-ecosystem they loved the most?
  • We are seeing an increasing number of citations from various institutions (e.g., UCLA, NASA)
  • We have great contributors working at important space agencies (ESA)
  • Most importantly, we are helping young geoscientists get their jobs done without a computer science degree.

Hope this text provides some answers. Please feel free to send me a direct message if something is not clear. At the end of the day, everyone is free to choose whatever they want.

EDIT:

You will see the exact same faces from JuliaGeo arguing aganist this view, mainly Raf and Anshul at this point. Let’s wait for their comments.

4 Likes

@juliohm this kind of post is unnecessary and gives a pretty bad image to Julia geospatial.

You chose to make your own organisation JuliaEarth outside of JuliaGeo, that’s fine, do what you like, the competition is healthy.

But if you check dependencies JuliaEarth stack depends heavily on JuliaGeo packages including GeoInterface.jl, and JuliaGeo does not depend on JuliaEarch packages. GeoIO.jl, for example, depends on at least ten packages that I maintain in JuliaGeo and elsewhere.

You contribute little to low level geospatial packages but at the same time insult the people who’s work you depend on. I struggle to understand this behaviour. Please, lets be constructive here and give credit where its due. We are all just trying to make geospatial tools better in Julia.

12 Likes

@Raf we would be much happier if GeoInterface.jl didn’t exist. You know better than most people here that 95% of your time is wasted implementing this interface in packages that could easily exist without it. It would be completely fine if Shapefile.jl and GeoJSON.jl (which are the actual good works here) implemented their own interface. If these packages documented their own interfaces well, we would not need GeoInterface.jl whatsoever. GeoIO.jl only depends on GeoInterface.jl because that is how you decided we should read things from disk.

You can argue that this is not scalable in the presence of multiple file formats, but we have evidence of the contrary. GeoIO.jl supports many more file formats that are not covered by GeoInterface.jl (because GeoInterface.jl is limited in scope) and never will.

If you stopped pushing this limited interface on everyone, we would be in a much better place today. It is barely implemented in other file formats, barely supported by other ecosystems.

I understand that we have different views on some of these points, and that is fine. I will continue working on the directions I believe will help the community the most and you can do the same.

The only package in JuliaEarth that depends on JuliaGeo explicitly is GeoIO.jl. Everything else is completely independent.

Yes IO is the main dependency, but JuliaGeo is nearly all about IO. And thanks for the praise of Shapefile and GeoJSON, JuliaGeo people have put a lot of work into them over the years. We always need help with documentation too.

And we can agree to differ on GeoInterface.jl. Yes its far from perfect but I think its is going ok. It’s facilitating performance like this :

But going back to the topic: I agree with op and @tiemvanderdeure what we are still kind of missing is a Rasters.jl like package for vector/geometry data that can lazily wrap data form multiple backends and standardise the interface for new users, throw errors when a package is missing, etc (although I’m not sure we need a terra like package that unites both? I guess there are various for and against)

To me minimal dependencies is important, so this means an extension system, and “lazily” is also important, as for example converting a Shapefile to another type is expensive due to the clunky specification, but rasterizing it directly with Rasters through GeoInterface.jl methods is blazing fast because we can e.g. skip polygon construction and just get rings directly.

GeoInterface.jl has laid the foundations for that (a lasy wrapper for loading geometries from multiple backends like Rasters.jl) to be easy, and GeometryOps.jl provides very competitive pure-julia algorithms. Maybe it should add file loading too? Or maybe that could be another package.

2 Likes

A couple years ago, I started a geo-related thread (which I will not necro here) asking whether or not there was interest in replacing the big package dependencies like GDAL or PROJ with pure Julia replacements (JDAL, etc.). Has this changed at all and are folks trying to eliminate GDAL etc. altogether? Or is the work more limited to creating that seamless interface but still built upon C/C++ libraries?

1 Like

Do you have examples/numbers of this? GDAL uses PROJ for project/reproject, how can it be slower?

I’ve been in the GDAL and PROJ e-mail list for years and the amount of knowledge of the people working in those projects is immernse. The idea of abandoning them is something that does not make any sense to me. Specially GDAL. People really have an idea of the work put in reading the myriad of raster & vector formats that GDAL provides?

3 Likes

The approach Rasters.jl and GeometryOps.jl are taking is to replace whatever is feasible by pure Julia code in a pragmatic way. GDAL and GEOS are incredibly extensive and building a pure julia replacement will take years more. E.g. Rasters.jl will call GDAL to read in a .tif file or to warp a raster, but not to rasterize or polygonize. Which I think is a super reasonable way to go.

4 Likes

Yes, we had at least a 6x speedup when we switched from GDAL to Proj to reproject single arrays in Rasters.jl. See Switch the new `cellarea` and `reproject` to Proj · Issue #781 · rafaqz/Rasters.jl · GitHub and its associated PR.

My understanding is that GDAL keeps copying points unnecessarily, which slows it down tremendously. With Proj, I can directly transform arrays of x and y in place, or even pass it coordinates individually - the runtime is about the same for each of these approaches. Transforming in place lets me avoid extra allocations on the Julia end.

GDAL is a super versatile library and I don’t think anyone is asking to abandon it completely. But it should always be the fallback option where possible.

3 Likes

Yeah, I don’t know if it makes sense to try for a 1-for-1 replacement, I was just curious what level of replacement folks are trying to acheive.

It’s good to know that Julia solutions can rival and/or beat the C/C++ libraries though!

2 Likes