Screening Interest in a unified Vector/Raster Package, akin to R Terra

What is truly surprising is we are beating C++ libraries by an order of magnitude in many cases. Rasters.zonal and Raster.extract have been worked over quite a few times now and maybe warrant those benchmarks, but GeometryOps.jl methods had those results very soon after we wrote the algorithms and we never intensely optimized them. Its quite puzzling really! We may even try wrapping them as binaries for python and R libraries to use now we have binary size reduction in Base Julia.

But making these performance gains more robust and easier to access for regular users is becoming really important. @souma I’m sorry if this post has been derailed in places but I think the point you are making is very important.

@joa-quim I agree totally about the effort that has gone into GDAL and Proj, they will always be useful. As comprehensive backends that handle any kind of file input or projection they are amazing. And Proj will be hard to beat because calling it from Julia is just like calling it from C. But binary dependencies do bring a bunch of problems.

And while the OOP C++ interface to most gdal functions works for Python and R wrappers, its very hard for gdal to match native Julia code where we can skip allocations, inline functions, and compile optimal methods for any types at runtime. Being able to @profview and Cthulhu.@descend through the whole julia stack shouldn’t be underestimated either. Part of the performance we get is that we can fix problems multiple packages at once rather than dealing with black boxes we have to work around.

2 Likes

To be honest I can’t say I understand the comparisons done in that PR. When we ask GDAL to do a projection, it calls gdalwarp, which implies point projections and an interpolation. When you replace that by PROJ alone it means that the interepolations will be on charge of Rasters. Are you saying that Rasters is doing the same interpolations that gdalwarp?

For curiosity I timed just the gdalwarp call in GMT.jl when projecting a raster like that in PR, and it took 80 ms (~20 slower than the ~4 ms reported in the issue and that I reproduced approximately in my laptop). And when I mean the gdalwarp call, I’m referring to the ccall to GDALWarp(...)

1 Like

That that PR we are just using Proj to reproject points. Rasters uses the the projected points to calculate the area of each pixel with various algs. Just getting GDAL to reproject the point (no interpolating just coordinates) was 6x slower than Proj because proj has bascially no overheads but gdal needs allocations.

1 Like

Honestly. It was useful in its time but the future is now old man lol

1 Like

Great thread. I’ll add my thoughts here as well as I have been stewing on this stuff for awile:

Let me start with what I think is great and thanking the few that have done so much in such a short period of time:

  1. JuliaGeo’s performance is truly impressive thanks to the hard work of a few. This is unique benefit of the ecosystem that I cherish with my big data workflows.
  2. JuliaGeo’s composability between packages. The hard work that went into GeoInterface and GeoFormatTypes has made composability/interoperability unparalleled (no supporting evidence :wink: )
  3. I personally believe that JuliaGeo’s adding packages like lego blocks to build the toolset that’s most relevant for one’s workflow is the future. Building a monolithic package can silo users and stunt creativity. The future of geospatial analysis is not replicating the past, it’s supporting past functionality with an eye towards innovation. Julia (language and culture) excels at inter-discipline operability which I believe is a power worth embracing. I have a Raster and I can just use Images.jl without either knowing about the other. Keep that up and others will have trouble replicating these cross-discipline capabilities that is our superpower.
  4. DimensionalData.jl is a work of beauty and now underlies our two major raster packages Rasters.jl and YAXArray.jl. DimensionalData embodies the principles of literate programming making our code easy to intuit.
  5. With the recent maturation of GeomertryOps.jl we now have a solid foundation for improving vector operations.
  6. The welcoming, supportive and encouraging community is one that I am proud to be associated with.

My thoughts on where we could make further gains (some of these thoughts simply echo what others have already said):

  1. GDAL (and Proj and GEOS) have their limitations but damn… what a huge effort by very clever individuals that changed the landscape of geospatial analysis. GDAL isn’t going anywhere, and new tooling and readers are constantly being added. We should not try to swim upstream on this one. Nearly every geospatial package in our ecosystem should default to GDAL, Proj and GEOS. This alone would improve the user experience as any geospatial dataset can be read and manipulated, avoiding the need for users to search for bespoke packages for each unique datatype. I find myself just using GeoDataFrames.jl to read all of my vector data as it abstracts away file type. But this comes with a cost that we’re not able to take advantage of some of the gains we can get using Julia native readers… to me the future is for GeoDataFrames.jl to adopt a Rasters style approach where native Julia packages are used when they exist otherwise defaulting to GDAL. The user should not need to have any knowledge of which lower-level package is used. This is where JuliaGeo is failing right now and could use some attention. Maybe a wrapper around GDAL (GEOS/Proj) that points to native packages when beneficial is a way that this could be done without bespoke implementations within each package. An ideological push to replace GDAL would use up more resources than we have… a strategic effort to incrementally replace GDAL functionality that results in maximum gains is probably our best approach here.

  2. There are new technologies that are revolutionizing how we access and work with massive archives of geospatial data that are increasing living on the cloud (Zarr/Kerchunk/Icecunk: GitHub - earth-mover/icechunk: Open-source, cloud-native transactional tensor storage engine). These technologies will help unlock knowledge that has remained trapped in the data due our inability to command the full program of record when doing scientific analysis / ML / AI. The Julia community will need to track/shape/adopt these technologies to be able efficiently interface with the wealth of geospatial data that’s out there. There has been great work with DiskArrays.jl / Zarr.jl / YAXArrays.jl in this domain, and very recently Kerchunk.jl… but as a community we want to make sure that we seamlessly support these technologies so that our tooling does not become dated to an era when analysis was done eagerly on personal computers with local storage. Given the newness and power of these technologies, early buy in for JuliaGeo would provide motivation (cost < benefit) for new users to invest the time to become proficient in Julia and JuliaGeo.

  3. We should not “force” users to lean anything they don’t think they need to learn by making the JuliaGeo tooling any more complex than the bare minimum needed to work efficiently and effectively with the tooling. Simplicity, readability and efficiency are the goals worth pursuing. Everything should be made as simple as possible, but not simpler. I’m a backwards learner (lazy learner) that is always exploring tooling and what it can do to benefit my analysis/research goals. If it does something helpful, then I learn more… if it becomes to complex or onerous I move on. I try to minimize my sunken cost. Probably not the best approach but I suspect that others are lazy/exploratory like me and take a similar approach. JuliaGeo should strive to embody literate programming with a Julian flavor. Ideally one could read the code without reading the documentation and understand what is being done to read and manipulate the data (another kudos to DimensionalData for accomplishing this).

  4. I 100% agree that now that we have GeometryOps and several vector format readers, the time is right to develop a vector version of Rasters.jl that can load data lazily and perform geometry operations without the package and type gymnastics that is currently needed… but I also realize that everyone is doing this on their weekends and holidays so we are also waiting from an enthusiastic contributor to take the lead after which I’m sure the community will enthusiastically support the effort.

  5. [added late] JuliaGeo should strive to abstract away CRS wherever possible. Treating CRS in a similar way as units are handled with Unitful.jl would go a long way to reducing the burden on users, especially those that are geo-curious but not mapping experts.

  6. Lastly, I’ve become humbled enough to realize that no one person knows what’s best for the community and giving room for competing packages to evolve separately with support from the same or different communities is the sign of a health ecosystem. Packages will compete, learn from, merge, and die… all in the name of progress.

10 Likes

I realize that I should just do one bulk reply rather than individual replies. Used to chat things more than forums.

This could be. I agree that the modularity is part of the ethos of Julia. As I think on this more, I think most of my concerns are in the “ability to seamlessly work with and translate between different spatial objects.” Because like, if Terra were split into SpatVect, SpatRast, and SpatTranslate (that translates between vector and rast objects) I would still call it a great suite of packages to learn and use. I’ll do some more digging on my own to see what current solutions there are to this issue in JuliaGeo. All this to say, I think it’s completely okay to have the individual modules, but have wrappers for core Geo types. I mean, R people called sf and sp for over a decade, similar happened in Python. It’s just part of ecosystem development.

I’ll be honest, yesterday I was sort of shocked there was a strong interest for native Julia I/O. In my head, I binned GDAL as “cpp library that’s been optimized for decades and would be standard for Geocomputing, sort of like BLAS.” Now as I think on this with hindsight and not understanding the whole history, GDAL, was, in a way, designed for functionality within interpreted languages. That is something that pure C++ or Julia does not need to account for. As a result, modified in place loop operations would be faster.

I could see the advantage of Julia I/O though. I know I’ve tried diving into Terra’s source code and I learned it is shockingly type restrictive and uses some WEIRD data structures to address that.

Reading these benchmarks and stuff shared, I’m stunned by the performance. Admittedly it’s not too shocking in hindsight considering I myself have taken R code and cut it by 90% in specific use cases, even when using compiled libraries. This is a SIGNIFICANT factor that I value a lot and it’s good to see.

That’s one weird thing I’ve found with Julia, it’s composability is amazing, but knowing how to get that can be tricky. That’s why stuff like GI and GFT can be amazing, but simultaneously feel limited. A part of me wants to just dive into tons of tests and examples and spend time writing documentation and guides on how to compose these types.

It does seem as if this is a very limited pool of contributors, which is to be expected for most OSS projects. I agree that strategy will be beneficial. Essentially, bite off the important pieces first, then migrate to the more optimal stuff in time.

I’d wager that most scientific analysts I know are backwards learners too. They will use their tried and true method until it cannot do what they want, then they either look elsewhere, or build their own tool. Literate programming will be essential, imo, for adoption. To be fair, I am very biased because I see Julia simply as “better R, so let’s PLEASE move to Julia.” Not 100% true, but without adoption that is 100% achievable.

Honestly reading the ecosystem differences, the competing organizations, it might be a good investment to just learn them all, find the benefits of each, and be adaptable in this process. And for science, maybe a core that you use that you’re comfortable with, but outside that explore and contribute where you think you can.

I’m very glad I asked this Q, seems like it’s moving discussions forward, plus it’s giving me inspiration.

4 Likes

Thanks for asking, I appreciate the post. You’re absolutely right that it is an tricky situation. It has grown organically from separate file drivers, different geometry types/formats (and trying to have a generic interface for them) and small utility packages when needed for our individual use cases.

I’ve always wondered whether the Julia ecosystem, with the focus on multiple dispatch/composability, would grow/need/require singular packages, or whether we could combine multiple individual packages without too much fuss. There are singular packages/ecosystems out there now, but I also still see many small composable packages (especially with Pkg extensions being LTS now). I think JuliaGeo has opted for the latter (maybe not even by choice, but simply limited by developer time), but we could/should spend more time on the seamless experience.

I don’t think GDAL/GEOS/PROJ is going away (and I don’t want it to) anytime soon, it’s a huge (ongoing!) effort, knowledge base and library. I don’t think we should want a JDAL (monolith), but ideally we have smaller composable pieces in Julia that can replace some bottlenecks (could be performance, can be non-Julia like coding). Theoretically, with the recent progress on compiling Julia, we could even make stuff that other languages/GDAL could depend on. In the same context, @Raf, @asinghvi17 and I are involved with Spatial Data Science across Languages (SDSL), where we find common spatial ground (default file formats, metadata, concepts) with the R and Python (and recently Rust) community.

Yeah. On the one hand it would make for ease of use. Besides, the distinction between raster and vector is mostly based on memory layouts/file formats, and shouldn’t be that important in practice (Meshes.jl has nicely unified this). On the other hand, I still like small composable packages and I see that many of my analyses require specific data formats, with lazy chunking/reads.

What could work is a JuliaGeo.jl meta package that just loads all packages that would support most Terra like operations? That would load GeoDataFrames, Rasters, GeometryOps, and by extensions GeoFormatTypes and Extents? Not sure about GeoMakie, although that might double the dependency. edit: First draft here GitHub - JuliaGeo/JuliaGeo.jl: Meta package to load all composable packages in the ecosystem.

That can be done easily, although as its package author I don’t want to claim it will suddenly live up to Rasters/Terra levels.

Based on the ping of @asinghvi17, I’ve just merged a PR that re-exports GeoFormatTypes and Extents in GeoDataFrames.jl, so that should make things a bit more streamlined already.

6 Likes

On the one hand it would make for ease of use. Besides, the distinction between raster and vector is mostly based on memory layouts/file formats

The main caveat I would make to this point is that raster data lets us use the regular AbstractArray interface and the whole julia ecosystem of tools that can manipulate arrays. Vector/geometry data will always be more limited in ecosystem scope. Python and R don’t really have AbstractArray capacity anyway (although there are efforts in that direction), so they don’t lose much by combining operations and abstracting across data types.

This is a core reason to keep raster and “vector” data conceptually separate.
Most of the time I do geospatial work in practice I’m treating a Raster as a regular AbstractArray and using Base and non-geo ecosystem methods on them.

3 Likes