What is the state of Rasterization and Rasterstats in Julia

I have followed the progress of the geo community in Julia every know and than. To me there seams to be a slow but steady progress. That both for raster and vector files.
Although, I for instead didn’t found out thinks, whether compression in geotiffs work, or whether other vector file formats like geojson, or geopackage are already working.
My main interest in this post lays in the question, whether how the connection between raster and vectors are working, too.
I am thinking, about rasterisation <–> vectorisation and especially about rasterstats. Especially about rasterstats, because I had the impression that it is slower in Python, than it needs to be. :wink:

Rasters.jl has some basic rasterization tools, and I’m pretty close to merging this GitHub - rafaqz/Rasters.jl at methods_cleanup which will add much better, and pretty fast standardisted rasterization and masking for all GeoInterface.jl vector objects, as well as other vector formats and dataframes. I’m kind of aiming for rasterize and mask from Rs terra, but generic to all multidimensional objects and hopefully a lot faster.

You can also use RasterStack (e.g. a NetCDF file or list of tifs) directly as a Tables.jl table in most stats packages. It should be fast, but needs read from disk before use. A point extraction/rasterize round trip is also pretty fast if you do that.

GeoTiff compression should work with ArchGDAL.jl, and you can pass compression arguments to write in Rasters.jl. It probably needs some testing, if you want to try that would be great.

And for JSON this should work fine?

1 Like

That sounds, not bad. SO can I assume that, geopackage is not supported so far and that rasterstats is not implemented yet?
Are these current implemented packages using C interfaces? I am somewhat curious, whether everything is also running on Arm. (That that I thing, that it is currently super important, but perhaps that might change one day?)

It will run on ARM if and where Julia runs on ARM, and if GDAL does wherever it is called. GeoPackage probably works via ArchGDAL? can be loaded with GeoDataframes.jl and presumably ArchGDAL.jl underneath.

For Rasters.jl I don’t often implement things that A. I don’t need personally, or B. aren’t in my issue queue (so I know someone actually cares if it exists). I guess others operate in a similar way. So its a really good idea to get these ideas into github issues where you clearly state the use case and what is missing. Also Rasters.jl is mostly implemented in pure Julia, with some calls to GDAL for things like gdalwarp and loading the huge list of files and crs formats that GDAL can handle, and its not worth writing again. Other packages similar, often with calls to GDAL or something for loading data. GDAL.jl, Proj.jl and similar packages are the raw C wrapper libraries.

I dont know exactly what you mean by rasterstats, but maybe look at GeoStats.jl ?

With Rasterstats I meant Zonal Statistics. The python package I new for it is called rasterstats (User Manual — rasterstats 0.10.3 documentation).

As for GeoStats: I remeber the presentation on last JuliaCon a bit. It seems different. Alhough this might look similar, but I realy really understand the meaning of documentation: Statistics · GeoStats.jl

As for GeoDataFrames: I was trying really hard, to find out why I could not get them from the registry. Til, I had a look into the registry:

]add GeoDataFrame`is the command as written on Github. But the Package is called GeoDataFrames. So one s was missing.

From the python documentation you linked, I think all you want is to filter the pixels of the image inside a polygon and then compute summaries on them such as mean std etc. You should be able to do that easily with v = view(img, polygon) and then mean(v.myvariable) for example. I don’t remember if we already added the view method above but it should be easily done as well. Can you give it a try and report back? I’m far from my computer in a remote place but can try debug it.

You need to (1) load the image using the georef function as described in the GeoStats.jl docs, (2) load the polygon geometry using GeoTables.jl and (3) try to view the image within the geometry.

Let me know if any of these steps aren’t working as described. They should be there and if they are not it is because we are missing some basic wrapper function for end users.

Oh right, zonal stats. I’m not sure we need a package for that in Julia? A Rasters.jl raster acts just like a regular array. So you can mask it with a polygon using mask and then just use base julia sum, mean etc and any other reducing function. In juila min and max are minimum and maximum when applied to a collection.

However, in Rasters.jl there is still a little handing required if you have missing values that are not missing - you need to use mean(skipmissing(replace_missing(raster))))

replace_missing without a second argument replaces whatever the missingval/NODATA value is with julias missing, and skipmissing is the standard way to skip missing values in Julia.

This can of course be simplified a little, but we are not sure of the exact behavior it should have. Feel free to contribute opinions in this thread:

Hi, I tried around abit.

I want to begin with a another topic: Because it’s holiday and it’s not like holidays. I wanted to try a tiny project with data from somewhere different on the globe. And ran into some strange issues. How do I deal with vectors (here shapefiles) which column names UTF-8 characters. UTF-8 characters as values are no problem, but UTF-8 Characters as column names, are not recognizable anymore.

About the view. I could not figure it out, how to deal with the GeoStats.Base.georef. I only works with vectors, right?

On the other hand masking as @Raf described worked fine:

mean_height = map(x -> mean(skipmissing(replace_missing(mask(dem_model; to=x)))), 
	vect.geom)

I just, had the fear that the speed might not be what I had expected (not compared yet). I thought perhaps masking the raster beforehand, to the geometry of the joined vectors could help, or multi-threading.
But I could not join the the geometry when I tried I got a concatenation of all the wkt’s. I did not try multi-threading today. Do you have some other tips?

Thanks alot!

You probably need to make an issue over at Shapefile.jl on github for the UTF8 problem.

The main problem I see at the moment with this code is that it may not work for larger than memory datasets. We need to make sure that whole pipeline happens lazily from disk chunks to the reducing function. (But doing this can actually makes things slower for smaller datasets).

We can probably define Base.skipmissing(::Raster) to also do replace_missing lazily, which should be nice syntax. We could also make the missingval values read as missing lazily, as in SentinelArrays.jl - so that would happen automatically. I’m not sure about the best strategy yet - there are syntax and performance tradeoffs to think about.

(Edit: I just tried implementing skipmissing directly for Raster missingval and it’s much faster - 5x faster than Base julia skipmissing on a regular array with missing values. Something like this will be included in Rasters.jl in the next months).

If you ever get time to benchmark against other languages I would be really interested in the results. Rasters.jl, like most julia packages, aspires to be faster than anything else. So anywhere it’s not is worth a bug report. One day we should have a benchmark against R raster/terra/star and python rasterio/xarray etc.

I doubt anything is really slow besides accidentally iterating over or indexing into a disk array index by index, which still can happen in some contexts, or maybe compile time on a 100 layer netcdf. But many things probably need work. Using read before working with small rasters is usually a good idea, to get rid of all the chunking from disk machinery that tends to slow everything down, but is unavoidable when files get too large.

Hi,

a) issue on Shapefiles.jl opened.
b) Speed compared.
I figured out, doing the zonal stats in the terminal is much faster than in Neptune.

Still it is much slower than the Python version, and I hoped I would come up with something faster.

On my Laptop the Python version(Taiwan_GIS/test_zonal_stat.py at master · Meresmata/Taiwan_GIS · GitHub) took about: 3.9 sec. While the Julia version (Taiwan_GIS/test_zonal_stat.jl at master · Meresmata/Taiwan_GIS · GitHub): needed 3.9 sec for opening the geojson with GeoDataFrames, Opening the raster was extremely fast, The zonal statistics I did took 43 sec.

My Laptop:

  • System:

    • Kernel: 5.11.0-43-generic x86_64
    • Desktop: KDE Plasma 5.18.5
    • Distro: Ubuntu 20.04.3 LTS (Focal Fossa)
  • Machine:

    • TUXEDO product: TUXEDO Pulse 14 Gen1
  • CPU:

    • Topology: 8-Core model: AMD Ryzen 7 4800H with Radeon Graphics
    • cache: 4096 KiB
      Speed: 1397 MHz min/max: 1400/2900 MHz
  • Julia: 1.6

Great!

Do you know about the @time macro, and @btime from BenchmarkTools.jl? a bit easier than doing all that manually. If you want to get more into this, there is also ProfileView.jl, which will give you an overview of where the time is spent, with @profview.

There is an order of magnitude improvement on skipmissing on the way in Rasters.jl, converting to missing and back turns out to be pretty inefficient. I’ll play with your example and see how fast we can get it.

Turns out there are 2 very small changes to get 2 orders of magnitude imrovement.

The first is to read the data before use:

dem_model = read(Raster(dem_path))

The raster is pretty big, but not larger-than-memory big, so this is the preferred method, especially if you use the same raster multiple times.

The second, and main improvement, is to crop each village to the polygon before mask/mean. Which is 250x faster :slight_smile: (probably mostly due to the 21GB allocations used by the full size mask)

julia> @time map(x -> mean(skipmissing(replace_missing(mask(dem_model; to=x)))), villages_wufeng.geom);
 17.857990 seconds (85.87 k allocations: 21.110 GiB, 6.43% gc time, 0.29% compilation time)

julia> @time map(x -> mean(skipmissing(replace_missing(mask(crop(dem_model; to=x); to=x)))), villages_wufeng.geom);
  0.079689 seconds (139.64 k allocations: 26.411 MiB, 72.04% compilation time)

But this only works on my methods_cleanup branch - I’ve only just implemented crop to geometry.

So expect this to be much faster quite soon, but you will also need to change your algorithm a little to get the improvements.

If you use BenchmarkTools, you’ll see its even faster than that, more like 3 orders of magnitude.

julia> using BenchmarkTools

julia> @btime map(x -> mean(skipmissing(replace_missing(mask(crop(dem_model; to=x); to=x)))), villages_wufeng.geom);
  16.913 ms (80475 allocations: 23.21 MiB)

Hi,

I tried out a bit. But I did not recognize for the read operation on the raster.

I have a question, when masking in python I normally got additional options, like all_touched or inverse masking. How would I specific all_touched in The Rasters package?

How would you compute in parallel in Python. In python I put the main code into a function do some kind of star map in parallel, so that a function call, process up to the raster and the vector of a specific region, so each call calculates a subspace.

In the example the shape was [3601, 3601, 1], how would mask a raster with [3601, 3601, 60] to get 60 different means for each mask?

As with crop to polygon, things like all_touched are only available in the branch, which should be merged in a few weeks. The keywords will be boundary=:touches, with :inside and :center the other options. Where only pixels inside the line are included or the center point of the pixel is used. If you have better names for these things let me know!

Parallel on a raster is just like parallel on any other julia array. I would use @threads. but again, use read first, I doubt it will work from disk.

As for the mean, mask will broadcast across all dimensions, so if you mask with a raster with X, Y dims it will mask all those 60 layers at once. Then you can take the mean across dimension:

mean(raster; dims=(1, 2))

Because Rasters.jl builds on DimensionalData.jl, you can also do:

mean(raster; dims=(X, Y))

But the most important thing to remember here is that a Raster is just an Array, with some extra things for spatial data. Especially after read(raster). So what you need to learn are basic julia array operations. Combining those with raster operations will eventually be more powerful than what you can do in Python or R. As long as other packages dispatch on AbstractArray you can just use a Raster like an Array everwhere in the Julia ecosystem. And it will plot as a raster at the end. That’s the idea anyway.

I found the Package ThreadTools and used it’s tmap. Don’t know, whether it’s a wise choice compared to Threads.threads, because due to the heat managing on my machine even with @btime the multi threading speed changed dramatically between to runs. Up to +/- 50 %.

Additional, I thought when the execution time is getting so small, it is better to get a larger raster / vector to compute on. Therefore I tried mosaicing with:

dem_models = map(x->read(Raster(x)), filter(x->occursin("n24", x), readdir("./DEM"; join=true)))
dem_mosaic = mosaic(first, map(Raster, dem_models))

The problem is it does not finish even after half an hour. Which worries me for such small raster.

Half hour is certainly a bug… I will try to reproduce.

These methods are all very new and not at all battle tested. Thanks for persevering with this it’s really useful for getting things more solid.

Laptop threading tends to not do much in my experience, because of the heat throttling you mention. Writing a smarter algorithm is almost always better thanjumping into threading, except when you know there is no more performance to squeeze.

I think you just missed a splat?

dem_mosaic = mosaic(first, dem_models...)

Otherwise you can use a Tuple. But a Vector of rasters isn’t an accepted input. I’ll limit the allowed inputs so that hang you experienced cant happen.

Edit: but also, these files are pretty large already - like 10 million pixels. Joining them will probably slow things down due to memory/cache limits. I would process them separately.

And finally, to join rasters that are next to each other you can just use cat(raster1, raster2; dims=Y). But you have to get them in the right order for it to work - e.g. cat(raster2, raster1; dims=Y) might be needed. cat is faster than mosaic because mosaic lets you combine images with different resolution.

Ahh! Thanks!
I didn’t figure out this that this Tuple could cause the problem. I thought, everything that would hold several data at once be it Tuple/List/Array/Vector might work similar at this place.
What does the elipsis operator do I your example above?

Why do you advice against mosaicing? How does the zonal_stat scale with number of pixel, number of Polygons. I thought, it would scale constant with number of pixel and linear with number of polygons.

But when I would have several rasters per vector file and don’t mosaic. I would have to:

  • filter out polygons that don’t overlap with the zonal statistic
  • do the zonal statistic
  • combine the data from several runs (one per raster)
  • combine data from the polygons, that lie on the border of the raster files

How do I filter out polygons (of a GeoDataFrame) outside a Raster. How do I figure out, which lie in on the border?
Can I mask a Raster by the extent of another Raster (For instance like a projection to the metadata of a different file - is this the resample function)?

The main topic, was the progress, on the Geo software stack. Did some ever try out, CNN on Rasters with Flux? For Example a UNet on Satellite Data?

Yes, maybe a Vector or any iterable should also work. The elipses are called a “splat”. It splats an iterable to separate function arguments, and the reverse in a function definition.

I thought you just wanted to mosaic for performance, which doesn’t make sense. But to simplify the problem it does.

Filtering polygons, you need to look at GeoDataFrames.jl and GeoInterface.jl. I don’t really used them much so that’s not a question for me. Polygon manipulation is not as organised as e.g. sf in R.

To mask to another raster you can probably do something likemask(raster; to=resample(mask; to=raster))

You should be able to use rasters in Flux. Even on a GPU. I used them on GPUs with DynamicGrids.jl all the time - its a big part of the design philosophy to make things like that work. That’s why everything is immutable.

Does this imply I could do the masking/zonal static on GPU? Or does the algorithm not perform, or needs to be changed?