Efficiently check if points are contained in polygons

ahmoreira · January 11, 2022, 5:50pm

Hi all, I am new to Julia and I am trying to check if points (plant occurrence points) fall within polygons from a shapefile. The code I wrote works, but I would like to make it more efficient. Currently, the code takes 52 seconds for 100 points. In R, using the function sf::st_join, it takes 2 seconds. Increasing efficiency is important because I will have to analyze millions of points.

The shapefile can be found here: https://www.worldwildlife.org/publications/terrestrial-ecoregions-of-the-world

Here is the function that I wrote:

using DataFrames
using ArchGDAL

# points
occurrence_points = ArchGDAL.createpoint.([-1.55, -2.57], [39.52, 42.5])

# Function to check if points overlay with ecoregions polygons
function overlay_ecoregions(points = Array{Tuple{Real, Real},1})
    # Read shapefile with ArchGDAL
    dataset_sf = ArchGDAL.read("wwf_terr_ecos.shp")
    eco_sf = ArchGDAL.getlayer(dataset_sf, 0)

    # Initiate dataframe and vector for the results
    overlay_df = DataFrame([Vector{Bool}(undef, 0) for i in 1:ArchGDAL.nfeature(eco_sf)])
    overlay = Vector{Bool}()

    for j in points
        for i in 0:ArchGDAL.nfeature(eco_sf)-1
            ArchGDAL.getfeature(eco_sf, i) do feature
                ArchGDAL.getgeom(feature) do geom
                    # Check if point is contained in the ecoregion polygon
                    append!(overlay, ArchGDAL.contains(geom, j))
                end
            end
        end
        # Add vector to dataframe
        push!(overlay_df, overlay)
        #print(findall( x -> x == j, points), "\n")

        # Reset vector to hold new values
        overlay = Vector{Bool}()
    end
    return overlay_df
end

overlay_df = overlay_ecoregions(occurrence_points)

Thank you a lot for the help!

PetrKryslUCSD · January 11, 2022, 6:00pm

Just a note: Real is an abstract data type. Template introducing a concrete data type might be better.
Otherwise, it looks like the computation is taking place in ArchGDAL. Perhaps it might be worthwhile to check the efficiency of this particular operation in the package.

I have done a lot of thinking about how to search for points (mesh generation with the advancing front method, and meshless discretization methods): What I have learned is the all-important radical and cheap pruning of the search space (various tree methods might be helpful). For instance: there is no point checking a polygon for the containment of the location, when the bounding box of the polygon does not contain the location in the first place.

Edit: These are not Julia packages, they are C-language libraries, but they may be nevertheless useful.
LOBS and PiRaT in Petr Krysl's Software Page

Raf · January 11, 2022, 10:41pm

Rasters.jl has inpolygon that you can pass a vector if points, which should be a lot faster.

DNF · January 11, 2022, 11:41pm

I find the code a bit difficult to read, but aren’t the sizes of the arrays and dataframes all the same? Iteratively resizing vectors over and over seems wasteful.

Also, getfeature and getgeom look to me like they are redoing the same work multiple times, but it’s not clear how expensive they are.

Forcing your input to be a vector with abstract eltype is a big red flag, make sure points is concrete. Though, I’m not sure that’s what you are doing:

Notice the = instead of :: after points.

In general, the code is a bit “messy”. You’re mixing in different concerns into a single function: reading a file, and parsing it into various objects, iterating over a vector of points, building a Dataframe. Isn’t it better to separate these tasks into different functions? I suspect that this will make it easier to identify bottlenecks, too.

ahmoreira · January 12, 2022, 7:20pm

Thank you for your reply! I will change the array data type. Thank you for pointing out these C packages, do you know where to find tutorials for them? For now I will focus on learning Julia, but it would be nice to have a look at these packages.

ahmoreira · January 12, 2022, 7:21pm

Cool! I will try this! Thank you so much

PetrKryslUCSD · January 13, 2022, 2:24am

I am afraid there is no tutorial. The LOBS library is extremely simple to use, and there is documentation in the form of comments. The idea of the library is to create a hierachical structure consisting of quickly-searchable bounding boxes of objects. Finding for instance an intersection of a line segment with a triangle, one would find all boxes containing triangles that overlap the box containing the segment. Then for each of the found boxes, test the triangle within the box for the intersection. Much cheaper than testing all triangles.

lawless-m · January 13, 2022, 9:44am

Meshes.jl also has an “in” operator for Polygons

I went to all the trouble of coding up BoundingBox code and found afterwards that it wasn’t needed, even for NGons with thousands of N

https://juliageometry.github.io/Meshes.jl/

joa-quim · January 13, 2022, 1:49pm

GMT.jl can do this pretty quickly. When one reads a shape file (or other OGR formats) we get a vector of GMTdataset that has, among others, a boundingbox field for each polygon. So we can do the quick scan through to see if points are, or not, inside each of the polygon’s BB and if yes than use the gmtselect module that gets us the point-in-polygon answer.

using GMT

D = gmtread("wwf_ecos/wwf_terr_ecos.shp");
    This file has islands (holes in polygons).
    Use `gmtread(..., no_islands=true)` to ignore them.

and we can see that there are 14841 polygons in that shp file

length(D)
14841

Now, this little function reports if a point falls inside any of those polygons

function inecos(D, lon, lat)
	iswithin(bbox, lon, lat) = (lon >= bbox[1] && lon <= bbox[2] && lat >= bbox[3] && lat <= bbox[4])

	for k = 1:length(D)
		!iswithin(D[k].bbox, lon, lat) && continue
		r = gmtselect([lon lat], polygon=D[k])
		if (!isempty(r))
			println("Point falls inside polygon $(k)")
			break
		end
	end
	nothing
end

Taking a point that falls inside the last polygon we see that it takes ~2 milli sec in my computer

@time inecos(D, 178.8, 51.6)
Point falls inside polygon 14841
  0.002249 seconds (316 allocations: 25.219 KiB)

So you can process ~1000 points in 2 seconds. 10x faster than your R solution.

And what polygon did it fall into?

D[14841].attrib
Dict{String, String} with 21 entries:
  "REALM"      => "NA"
  "ECO_ID"     => "51102.00000000000"
  "G200_NUM"   => "0.00000000000"
  "ECO_NAME"   => "Aleutian Islands tundra"
  "Shape_Area" => "0.03981285655"
  "G200_BIOME" => "0.00000000000"
  "G200_STAT"  => "0.00000000000"
  "PER_area"   => "0.00000000000"
  "GBL_STAT"   => "3.00000000000"
  "AREA"       => "307.56219168200"
  "BIOME"      => "11.00000000000"
  "OBJECTID"   => "14923"
  "ECO_NUM"    => "2.00000000000"
  "area_km2"   => "12141"
  "PERIMETER"  => "2.05118966103"
  "PER_area_2" => "0.00000000000"
  "PER_area_1" => "0.00000000000"
  "eco_code"   => "NA1102"
  "G200_REGIO" => ""
...

BUT, I had to fix an issue with the gmtselect module and the fix is only in the GMT.jl master version (until I make a new release ofc).

PetrKryslUCSD · January 13, 2022, 3:10pm

Btw, the key to searching millions of polygons is to have a tree, so that not all boxes/polygons need to be tested.

joa-quim · January 13, 2022, 3:27pm

Yeap, that’s index info in the .shx file (I think) but I’m using it (actually, I have no idea on how to use it )

ahmoreira · January 14, 2022, 8:29am

Thank you for pointing out this package, I will try it out!

ahmoreira · January 14, 2022, 8:34am

Cool! Thanks for your reply! I will try your solution in the weekend. I am curious to compare the results from GMT.jl, Rasters.jl and Meshes.jl

juliohm · January 14, 2022, 9:01am

Notice that Meshes.jl is written in pure Julia, and unlike the alternatives that depend on external libraries (e.g. GDAL) it is not heavily optimized yet. A benchmark would be great to help us prioritize optimizations.

The good news is that you can easily read the source code and contribute with pull requests. We love PRs with speedups.

Raf · January 14, 2022, 9:55am

To clarify, Rasters.jl polygon ops are also pure julia via PolygonInbounds.jl and other internal code. It’s pretty fast for vectors of points, but not so much for single points. Its very fast for rasterize because the points of a raster are already sorted.

(Rasters.jl only uses GDAL for loading some file types, because GDAL has so many and is reliable, and for warping and reprojecting, because projection formats are hard)

ahmoreira · January 25, 2022, 6:57pm

Thank you all for the suggestions and feedback! As suggested, I tried the different packages: Rasters.jl, Meshes.jl, ArchGDAL.jl and GMT.jl. For me the fastest was to use GMT.jl. However, as a disclamer I would like to point out that I am reshaping the results inside the functions so that the function returns a dataframe of true/false and that, I guess, my code could be more efficient. If you have more suggestions, please let me know!

Here is the code:

using DataFrames
using Shapefile
using GeoInterface
using Tables
using GBIF
using Meshes
using GeoTables
using Rasters
using GMT
using BenchmarkTools
using ArchGDAL

# Download some occurrence data
obs = GBIF.occurrences("scientificName" => "Burramys parvus","hasCoordinate" => "true", "limit" => 100)
points = map(o -> (o.longitude, o.latitude), obs)

######################## Rasters ############################
# Read shapefile
geoms = Shapefile.shapes(Shapefile.Table("wwf_terr_ecos.shp"))

# Function using Rasters.jl
function overlay_rasters(points :: Array{Tuple{Float64, Float64},1}, geoms)
    # Initiate dataframe to hold results
    # How to efficiently create a dataframe of n rows and m cols filled with false?
    overlay_df = DataFrame([Vector{Bool}(undef, length(points)) for i in 1:length(geoms)], :auto)
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end

    for i in 1:length(geoms)
        # Overlay point with polygons
        overlay = inpolygon(points, geoms[i])
        
        # Transform result to table, then populate dataframe
        # A bitmatrix = 1 0 , means the point is in the polygon or not? What would be a good way to gather the results in the dataframe?
        overlay_tb = Tables.table(overlay)
        res = overlay_tb.Column1 + overlay_tb.Column2
        overlay_df[!, i] = res

    end
    return overlay_df
end

df_rasters = overlay_rasters(points, geoms)

################## Meshes #####################
# Read shapefile
geo_table = GeoTables.load("wwf_terr_ecos.shp")

function overlay_meshes(points :: Array{Tuple{Float64, Float64},1}, geo_table)
    # Initiate dataframe to hold results
    overlay_df = DataFrame([Vector{Bool}(undef, length(points)) for i in 1:length(geo_table.geometry)], :auto)
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end

    # Overlay points with polygons
    for j in 1:length(points)
        for i in 1:length(geo_table.geometry)
            if points[j] in geo_table.geometry[i]
                print("point ", j, " in ecoregion", i)
                overlay_df[j, i] = true
            break
            end 
        end
    end
    return overlay_df
end

df_meshes = overlay_meshes(points, geo_table)

############################# GMT ##############################
# Read shapefile
# Why the gmt.jl object has a different length than the objects read with rasters.jl, meshes.jl and ArchGDAL.jl?
geo_gmt = gmtread("wwf_terr_ecos.shp")
lon = map(o -> (o.longitude), obs)
lat = map(o -> (o.latitude), obs)

# Function using GMT.jl
function overlay_gmt(geo_gmt, lon, lat)
    # Initiate dataframe to hol results
    overlay_df = DataFrame([Vector{Bool}(undef, length(lon)) for i in 1:length(geo_gmt)], :auto)
    # Replace true by false
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end
    
	iswithin(bbox, lon, lat) = (lon >= bbox[1] && lon <= bbox[2] && lat >= bbox[3] && lat <= bbox[4])
    for j in 1:length(points)
        for i = 1:length(geo_gmt)
            !iswithin(geo_gmt[i].bbox, lon[j], lat[j]) && continue
            r = gmtselect([lon lat], polygon = geo_gmt[i])
            if (!isempty(r))
                # Populate dataframe with results
                overlay_df[j, i] = true 
                #println("Point falls inside polygon $(i)")
                break
            end
        end
    end
    return overlay_df
end

df_gmt = overlay_gmt(geo_gmt, lon, lat)

############################### ArchGDAL ######################################
# Read shapefile with ArchGDAL
eco_sf = ArchGDAL.getlayer(ArchGDAL.read("wwf_terr_ecos.shp"), 0)
lon = map(o -> (o.longitude), obs)
lat = map(o -> (o.latitude), obs)

function overlay_archgdal(eco_sf, lon, lat)
    # Transform GBIF occurrences in ArchGDAL point geometries
    points = ArchGDAL.createpoint.(lon, lat)

    # Initiate dataframe to hold results
    overlay_df = DataFrame([Vector{Bool}(undef, length(points)) for i in 1:length(eco_sf)], :auto)
    # Replace true by false
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end

    for j in 1:length(points)
        for i in 0:ArchGDAL.nfeature(eco_sf)-1
            ArchGDAL.getfeature(eco_sf, i) do feature
                ArchGDAL.getgeom(feature) do geom
                    # Check if point is contained in the ecoregion polygon
                    overlay_df[j, i+1] = ArchGDAL.contains(geom, points[j]) 
                end
            end
        end
    end
    return overlay_df
end

df_archgdal = overlay_archgdal(eco_sf, lon, lat)

# Benchmark using Benchmark.jl
@btime overlay_archgdal(eco_sf, lon, lat) 
@btime overlay_gmt(geo_gmt, lon, lat)
@btime overlay_meshes(points, geo_table)
@btime overlay_rasters(points, geoms)

Jeff_Emanuel · January 25, 2022, 7:24pm

I would consider switching the order of loops so that you don’t have to do getfeature and getgeom on each feature length(points) times.

juliohm · January 25, 2022, 7:50pm

Regarding the Meshes.jl code, avoid using geo_table.geometry because that access is not type stable. If you care about performance you can use:

geotable = GeoTables.load("foo.shp")

# convert Tuple to Meshes.Point
ps = Point.(points)

# get geometries from table using a type stable method
gs = domain(geotable)

# incidence matrix
[p ∈ g for p in ps, g in gs]

Raf · January 25, 2022, 7:58pm

inpolygon in Rasters.jl is really a multi-point method, slow for single points. Its just for convenience that it even works for one point. It’s fast for many points by sorting the points and doing them all together.

But you’re looping and calling it for each point. You can instead just pass in the whole geometry as-is, and get back a vector of Bool.

ig-or · April 27, 2023, 5:02am

For Meshes.jl, I found it super slow… I have a matrix of points, and hull() function needs a ‘set’ of points. So Creating Meshes.PointSet from a matrix takes all the time (about 15 minutes in my case).

Topic		Replies	Views
Speed up ArchGDAL.contains for thousands of data Geo	8	821	March 25, 2022
Spatial join with dataframes General Usage dataframes , geo	27	2084	September 6, 2024
Find if point is within geojson polygons Geo polygons , archgdal , libgeos	3	1622	October 26, 2021
[ANN] Meshes.jl - Computational Geometry in Julia Package Announcements package , announcement , mesh , geo , geometry	56	12433	October 24, 2024
How can I rasterize the polygons of a Shapefile in Julia? Geo gmt	18	4355	September 20, 2018

Efficiently check if points are contained in polygons

Related topics