Efficiently check if points are contained in polygons

Rasters.jl has inpolygon that you can pass a vector if points, which should be a lot faster.

1 Like

I find the code a bit difficult to read, but aren’t the sizes of the arrays and dataframes all the same? Iteratively resizing vectors over and over seems wasteful.

Also, getfeature and getgeom look to me like they are redoing the same work multiple times, but it’s not clear how expensive they are.

Forcing your input to be a vector with abstract eltype is a big red flag, make sure points is concrete. Though, I’m not sure that’s what you are doing:

Notice the = instead of :: after points.

In general, the code is a bit “messy”. You’re mixing in different concerns into a single function: reading a file, and parsing it into various objects, iterating over a vector of points, building a Dataframe. Isn’t it better to separate these tasks into different functions? I suspect that this will make it easier to identify bottlenecks, too.

1 Like

Thank you for your reply! I will change the array data type. Thank you for pointing out these C packages, do you know where to find tutorials for them? For now I will focus on learning Julia, but it would be nice to have a look at these packages.

Cool! I will try this! Thank you so much

I am afraid there is no tutorial. The LOBS library is extremely simple to use, and there is documentation in the form of comments. The idea of the library is to create a hierachical structure consisting of quickly-searchable bounding boxes of objects. Finding for instance an intersection of a line segment with a triangle, one would find all boxes containing triangles that overlap the box containing the segment. Then for each of the found boxes, test the triangle within the box for the intersection. Much cheaper than testing all triangles.

Meshes.jl also has an “in” operator for Polygons

I went to all the trouble of coding up BoundingBox code and found afterwards that it wasn’t needed, even for NGons with thousands of N

https://juliageometry.github.io/Meshes.jl/

GMT.jl can do this pretty quickly. When one reads a shape file (or other OGR formats) we get a vector of GMTdataset that has, among others, a boundingbox field for each polygon. So we can do the quick scan through to see if points are, or not, inside each of the polygon’s BB and if yes than use the gmtselect module that gets us the point-in-polygon answer.

using GMT

D = gmtread("wwf_ecos/wwf_terr_ecos.shp");
    This file has islands (holes in polygons).
    Use `gmtread(..., no_islands=true)` to ignore them.

and we can see that there are 14841 polygons in that shp file

length(D)
14841

Now, this little function reports if a point falls inside any of those polygons

function inecos(D, lon, lat)
	iswithin(bbox, lon, lat) = (lon >= bbox[1] && lon <= bbox[2] && lat >= bbox[3] && lat <= bbox[4])

	for k = 1:length(D)
		!iswithin(D[k].bbox, lon, lat) && continue
		r = gmtselect([lon lat], polygon=D[k])
		if (!isempty(r))
			println("Point falls inside polygon $(k)")
			break
		end
	end
	nothing
end

Taking a point that falls inside the last polygon we see that it takes ~2 milli sec in my computer

@time inecos(D, 178.8, 51.6)
Point falls inside polygon 14841
  0.002249 seconds (316 allocations: 25.219 KiB)

So you can process ~1000 points in 2 seconds. 10x faster than your R solution.

And what polygon did it fall into?

D[14841].attrib
Dict{String, String} with 21 entries:
  "REALM"      => "NA"
  "ECO_ID"     => "51102.00000000000"
  "G200_NUM"   => "0.00000000000"
  "ECO_NAME"   => "Aleutian Islands tundra"
  "Shape_Area" => "0.03981285655"
  "G200_BIOME" => "0.00000000000"
  "G200_STAT"  => "0.00000000000"
  "PER_area"   => "0.00000000000"
  "GBL_STAT"   => "3.00000000000"
  "AREA"       => "307.56219168200"
  "BIOME"      => "11.00000000000"
  "OBJECTID"   => "14923"
  "ECO_NUM"    => "2.00000000000"
  "area_km2"   => "12141"
  "PERIMETER"  => "2.05118966103"
  "PER_area_2" => "0.00000000000"
  "PER_area_1" => "0.00000000000"
  "eco_code"   => "NA1102"
  "G200_REGIO" => ""
...

BUT, I had to fix an issue with the gmtselect module and the fix is only in the GMT.jl master version (until I make a new release ofc).

Btw, the key to searching millions of polygons is to have a tree, so that not all boxes/polygons need to be tested.

Yeap, that’s index info in the .shx file (I think) but I’m using it (actually, I have no idea on how to use it :slight_smile: )

Thank you for pointing out this package, I will try it out!

Cool! Thanks for your reply! I will try your solution in the weekend. I am curious to compare the results from GMT.jl, Rasters.jl and Meshes.jl

1 Like

Notice that Meshes.jl is written in pure Julia, and unlike the alternatives that depend on external libraries (e.g. GDAL) it is not heavily optimized yet. A benchmark would be great to help us prioritize optimizations.

The good news is that you can easily read the source code and contribute with pull requests. We love PRs with speedups.

2 Likes

To clarify, Rasters.jl polygon ops are also pure julia via PolygonInbounds.jl and other internal code. It’s pretty fast for vectors of points, but not so much for single points. Its very fast for rasterize because the points of a raster are already sorted.

(Rasters.jl only uses GDAL for loading some file types, because GDAL has so many and is reliable, and for warping and reprojecting, because projection formats are hard)

3 Likes

Thank you all for the suggestions and feedback! As suggested, I tried the different packages: Rasters.jl, Meshes.jl, ArchGDAL.jl and GMT.jl. For me the fastest was to use GMT.jl. However, as a disclamer I would like to point out that I am reshaping the results inside the functions so that the function returns a dataframe of true/false and that, I guess, my code could be more efficient. If you have more suggestions, please let me know!

Here is the code:

using DataFrames
using Shapefile
using GeoInterface
using Tables
using GBIF
using Meshes
using GeoTables
using Rasters
using GMT
using BenchmarkTools
using ArchGDAL

# Download some occurrence data
obs = GBIF.occurrences("scientificName" => "Burramys parvus","hasCoordinate" => "true", "limit" => 100)
points = map(o -> (o.longitude, o.latitude), obs)

######################## Rasters ############################
# Read shapefile
geoms = Shapefile.shapes(Shapefile.Table("wwf_terr_ecos.shp"))

# Function using Rasters.jl
function overlay_rasters(points :: Array{Tuple{Float64, Float64},1}, geoms)
    # Initiate dataframe to hold results
    # How to efficiently create a dataframe of n rows and m cols filled with false?
    overlay_df = DataFrame([Vector{Bool}(undef, length(points)) for i in 1:length(geoms)], :auto)
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end

    for i in 1:length(geoms)
        # Overlay point with polygons
        overlay = inpolygon(points, geoms[i])
        
        # Transform result to table, then populate dataframe
        # A bitmatrix = 1 0 , means the point is in the polygon or not? What would be a good way to gather the results in the dataframe?
        overlay_tb = Tables.table(overlay)
        res = overlay_tb.Column1 + overlay_tb.Column2
        overlay_df[!, i] = res

    end
    return overlay_df
end

df_rasters = overlay_rasters(points, geoms)

################## Meshes #####################
# Read shapefile
geo_table = GeoTables.load("wwf_terr_ecos.shp")

function overlay_meshes(points :: Array{Tuple{Float64, Float64},1}, geo_table)
    # Initiate dataframe to hold results
    overlay_df = DataFrame([Vector{Bool}(undef, length(points)) for i in 1:length(geo_table.geometry)], :auto)
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end

    # Overlay points with polygons
    for j in 1:length(points)
        for i in 1:length(geo_table.geometry)
            if points[j] in geo_table.geometry[i]
                print("point ", j, " in ecoregion", i)
                overlay_df[j, i] = true
            break
            end 
        end
    end
    return overlay_df
end

df_meshes = overlay_meshes(points, geo_table)

############################# GMT ##############################
# Read shapefile
# Why the gmt.jl object has a different length than the objects read with rasters.jl, meshes.jl and ArchGDAL.jl?
geo_gmt = gmtread("wwf_terr_ecos.shp")
lon = map(o -> (o.longitude), obs)
lat = map(o -> (o.latitude), obs)

# Function using GMT.jl
function overlay_gmt(geo_gmt, lon, lat)
    # Initiate dataframe to hol results
    overlay_df = DataFrame([Vector{Bool}(undef, length(lon)) for i in 1:length(geo_gmt)], :auto)
    # Replace true by false
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end
    
	iswithin(bbox, lon, lat) = (lon >= bbox[1] && lon <= bbox[2] && lat >= bbox[3] && lat <= bbox[4])
    for j in 1:length(points)
        for i = 1:length(geo_gmt)
            !iswithin(geo_gmt[i].bbox, lon[j], lat[j]) && continue
            r = gmtselect([lon lat], polygon = geo_gmt[i])
            if (!isempty(r))
                # Populate dataframe with results
                overlay_df[j, i] = true 
                #println("Point falls inside polygon $(i)")
                break
            end
        end
    end
    return overlay_df
end

df_gmt = overlay_gmt(geo_gmt, lon, lat)

############################### ArchGDAL ######################################
# Read shapefile with ArchGDAL
eco_sf = ArchGDAL.getlayer(ArchGDAL.read("wwf_terr_ecos.shp"), 0)
lon = map(o -> (o.longitude), obs)
lat = map(o -> (o.latitude), obs)

function overlay_archgdal(eco_sf, lon, lat)
    # Transform GBIF occurrences in ArchGDAL point geometries
    points = ArchGDAL.createpoint.(lon, lat)

    # Initiate dataframe to hold results
    overlay_df = DataFrame([Vector{Bool}(undef, length(points)) for i in 1:length(eco_sf)], :auto)
    # Replace true by false
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end

    for j in 1:length(points)
        for i in 0:ArchGDAL.nfeature(eco_sf)-1
            ArchGDAL.getfeature(eco_sf, i) do feature
                ArchGDAL.getgeom(feature) do geom
                    # Check if point is contained in the ecoregion polygon
                    overlay_df[j, i+1] = ArchGDAL.contains(geom, points[j]) 
                end
            end
        end
    end
    return overlay_df
end

df_archgdal = overlay_archgdal(eco_sf, lon, lat)

# Benchmark using Benchmark.jl
@btime overlay_archgdal(eco_sf, lon, lat) 
@btime overlay_gmt(geo_gmt, lon, lat)
@btime overlay_meshes(points, geo_table)
@btime overlay_rasters(points, geoms)

I would consider switching the order of loops so that you don’t have to do getfeature and getgeom on each feature length(points) times.

Regarding the Meshes.jl code, avoid using geo_table.geometry because that access is not type stable. If you care about performance you can use:

geotable = GeoTables.load("foo.shp")

# convert Tuple to Meshes.Point
ps = Point.(points)

# get geometries from table using a type stable method
gs = domain(geotable)

# incidence matrix
[p ∈ g for p in ps, g in gs]

inpolygon in Rasters.jl is really a multi-point method, slow for single points. Its just for convenience that it even works for one point. It’s fast for many points by sorting the points and doing them all together.

But you’re looping and calling it for each point. You can instead just pass in the whole geometry as-is, and get back a vector of Bool.

2 Likes

For Meshes.jl, I found it super slow… I have a matrix of points, and hull() function needs a ‘set’ of points. So Creating Meshes.PointSet from a matrix takes all the time (about 15 minutes in my case).

Yes, we only have a method that checks each point at a time. There are other algorithms as others pointed out for multiple points at once. Pull requests are welcome.

I ended up making a Julia function from wiki’s algorithm pseudocode… works fast enough and no memory allocations, but its too off the ideology of Meshes.jl. would be not simple to integrate. I have a case where I’m making a convex hull for just a few points at once, but 10 - 50 million times. So need everything to be fast.