Efficiently check if points are contained in polygons

Hi all, I am new to Julia and I am trying to check if points (plant occurrence points) fall within polygons from a shapefile. The code I wrote works, but I would like to make it more efficient. Currently, the code takes 52 seconds for 100 points. In R, using the function sf::st_join, it takes 2 seconds. Increasing efficiency is important because I will have to analyze millions of points.

The shapefile can be found here: https://www.worldwildlife.org/publications/terrestrial-ecoregions-of-the-world

Here is the function that I wrote:

using DataFrames
using ArchGDAL

# points
occurrence_points = ArchGDAL.createpoint.([-1.55, -2.57], [39.52, 42.5])

# Function to check if points overlay with ecoregions polygons
function overlay_ecoregions(points = Array{Tuple{Real, Real},1})
    # Read shapefile with ArchGDAL
    dataset_sf = ArchGDAL.read("wwf_terr_ecos.shp")
    eco_sf = ArchGDAL.getlayer(dataset_sf, 0)

    # Initiate dataframe and vector for the results
    overlay_df = DataFrame([Vector{Bool}(undef, 0) for i in 1:ArchGDAL.nfeature(eco_sf)])
    overlay = Vector{Bool}()

    for j in points
        for i in 0:ArchGDAL.nfeature(eco_sf)-1
            ArchGDAL.getfeature(eco_sf, i) do feature
                ArchGDAL.getgeom(feature) do geom
                    # Check if point is contained in the ecoregion polygon
                    append!(overlay, ArchGDAL.contains(geom, j))
                end
            end
        end
        # Add vector to dataframe
        push!(overlay_df, overlay)
        #print(findall( x -> x == j, points), "\n")

        # Reset vector to hold new values
        overlay = Vector{Bool}()
    end
    return overlay_df
end

overlay_df = overlay_ecoregions(occurrence_points)

Thank you a lot for the help!

1 Like

Just a note: Real is an abstract data type. Template introducing a concrete data type might be better.
Otherwise, it looks like the computation is taking place in ArchGDAL. Perhaps it might be worthwhile to check the efficiency of this particular operation in the package.

I have done a lot of thinking about how to search for points (mesh generation with the advancing front method, and meshless discretization methods): What I have learned is the all-important radical and cheap pruning of the search space (various tree methods might be helpful). For instance: there is no point checking a polygon for the containment of the location, when the bounding box of the polygon does not contain the location in the first place.

Edit: These are not Julia packages, they are C-language libraries, but they may be nevertheless useful.
LOBS and PiRaT in Petr Krysl's Software Page

4 Likes

Rasters.jl has inpolygon that you can pass a vector if points, which should be a lot faster.

1 Like

I find the code a bit difficult to read, but aren’t the sizes of the arrays and dataframes all the same? Iteratively resizing vectors over and over seems wasteful.

Also, getfeature and getgeom look to me like they are redoing the same work multiple times, but it’s not clear how expensive they are.

Forcing your input to be a vector with abstract eltype is a big red flag, make sure points is concrete. Though, I’m not sure that’s what you are doing:

Notice the = instead of :: after points.

In general, the code is a bit “messy”. You’re mixing in different concerns into a single function: reading a file, and parsing it into various objects, iterating over a vector of points, building a Dataframe. Isn’t it better to separate these tasks into different functions? I suspect that this will make it easier to identify bottlenecks, too.

1 Like

Thank you for your reply! I will change the array data type. Thank you for pointing out these C packages, do you know where to find tutorials for them? For now I will focus on learning Julia, but it would be nice to have a look at these packages.

Cool! I will try this! Thank you so much

I am afraid there is no tutorial. The LOBS library is extremely simple to use, and there is documentation in the form of comments. The idea of the library is to create a hierachical structure consisting of quickly-searchable bounding boxes of objects. Finding for instance an intersection of a line segment with a triangle, one would find all boxes containing triangles that overlap the box containing the segment. Then for each of the found boxes, test the triangle within the box for the intersection. Much cheaper than testing all triangles.

Meshes.jl also has an “in” operator for Polygons

I went to all the trouble of coding up BoundingBox code and found afterwards that it wasn’t needed, even for NGons with thousands of N

https://juliageometry.github.io/Meshes.jl/

GMT.jl can do this pretty quickly. When one reads a shape file (or other OGR formats) we get a vector of GMTdataset that has, among others, a boundingbox field for each polygon. So we can do the quick scan through to see if points are, or not, inside each of the polygon’s BB and if yes than use the gmtselect module that gets us the point-in-polygon answer.

using GMT

D = gmtread("wwf_ecos/wwf_terr_ecos.shp");
    This file has islands (holes in polygons).
    Use `gmtread(..., no_islands=true)` to ignore them.

and we can see that there are 14841 polygons in that shp file

length(D)
14841

Now, this little function reports if a point falls inside any of those polygons

function inecos(D, lon, lat)
	iswithin(bbox, lon, lat) = (lon >= bbox[1] && lon <= bbox[2] && lat >= bbox[3] && lat <= bbox[4])

	for k = 1:length(D)
		!iswithin(D[k].bbox, lon, lat) && continue
		r = gmtselect([lon lat], polygon=D[k])
		if (!isempty(r))
			println("Point falls inside polygon $(k)")
			break
		end
	end
	nothing
end

Taking a point that falls inside the last polygon we see that it takes ~2 milli sec in my computer

@time inecos(D, 178.8, 51.6)
Point falls inside polygon 14841
  0.002249 seconds (316 allocations: 25.219 KiB)

So you can process ~1000 points in 2 seconds. 10x faster than your R solution.

And what polygon did it fall into?

D[14841].attrib
Dict{String, String} with 21 entries:
  "REALM"      => "NA"
  "ECO_ID"     => "51102.00000000000"
  "G200_NUM"   => "0.00000000000"
  "ECO_NAME"   => "Aleutian Islands tundra"
  "Shape_Area" => "0.03981285655"
  "G200_BIOME" => "0.00000000000"
  "G200_STAT"  => "0.00000000000"
  "PER_area"   => "0.00000000000"
  "GBL_STAT"   => "3.00000000000"
  "AREA"       => "307.56219168200"
  "BIOME"      => "11.00000000000"
  "OBJECTID"   => "14923"
  "ECO_NUM"    => "2.00000000000"
  "area_km2"   => "12141"
  "PERIMETER"  => "2.05118966103"
  "PER_area_2" => "0.00000000000"
  "PER_area_1" => "0.00000000000"
  "eco_code"   => "NA1102"
  "G200_REGIO" => ""
...

BUT, I had to fix an issue with the gmtselect module and the fix is only in the GMT.jl master version (until I make a new release ofc).

Btw, the key to searching millions of polygons is to have a tree, so that not all boxes/polygons need to be tested.

Yeap, that’s index info in the .shx file (I think) but I’m using it (actually, I have no idea on how to use it :slight_smile: )

Thank you for pointing out this package, I will try it out!

Cool! Thanks for your reply! I will try your solution in the weekend. I am curious to compare the results from GMT.jl, Rasters.jl and Meshes.jl

1 Like

Notice that Meshes.jl is written in pure Julia, and unlike the alternatives that depend on external libraries (e.g. GDAL) it is not heavily optimized yet. A benchmark would be great to help us prioritize optimizations.

The good news is that you can easily read the source code and contribute with pull requests. We love PRs with speedups.

2 Likes

To clarify, Rasters.jl polygon ops are also pure julia via PolygonInbounds.jl and other internal code. It’s pretty fast for vectors of points, but not so much for single points. Its very fast for rasterize because the points of a raster are already sorted.

(Rasters.jl only uses GDAL for loading some file types, because GDAL has so many and is reliable, and for warping and reprojecting, because projection formats are hard)

3 Likes

Thank you all for the suggestions and feedback! As suggested, I tried the different packages: Rasters.jl, Meshes.jl, ArchGDAL.jl and GMT.jl. For me the fastest was to use GMT.jl. However, as a disclamer I would like to point out that I am reshaping the results inside the functions so that the function returns a dataframe of true/false and that, I guess, my code could be more efficient. If you have more suggestions, please let me know!

Here is the code:

using DataFrames
using Shapefile
using GeoInterface
using Tables
using GBIF
using Meshes
using GeoTables
using Rasters
using GMT
using BenchmarkTools
using ArchGDAL

# Download some occurrence data
obs = GBIF.occurrences("scientificName" => "Burramys parvus","hasCoordinate" => "true", "limit" => 100)
points = map(o -> (o.longitude, o.latitude), obs)

######################## Rasters ############################
# Read shapefile
geoms = Shapefile.shapes(Shapefile.Table("wwf_terr_ecos.shp"))

# Function using Rasters.jl
function overlay_rasters(points :: Array{Tuple{Float64, Float64},1}, geoms)
    # Initiate dataframe to hold results
    # How to efficiently create a dataframe of n rows and m cols filled with false?
    overlay_df = DataFrame([Vector{Bool}(undef, length(points)) for i in 1:length(geoms)], :auto)
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end

    for i in 1:length(geoms)
        # Overlay point with polygons
        overlay = inpolygon(points, geoms[i])
        
        # Transform result to table, then populate dataframe
        # A bitmatrix = 1 0 , means the point is in the polygon or not? What would be a good way to gather the results in the dataframe?
        overlay_tb = Tables.table(overlay)
        res = overlay_tb.Column1 + overlay_tb.Column2
        overlay_df[!, i] = res

    end
    return overlay_df
end

df_rasters = overlay_rasters(points, geoms)

################## Meshes #####################
# Read shapefile
geo_table = GeoTables.load("wwf_terr_ecos.shp")

function overlay_meshes(points :: Array{Tuple{Float64, Float64},1}, geo_table)
    # Initiate dataframe to hold results
    overlay_df = DataFrame([Vector{Bool}(undef, length(points)) for i in 1:length(geo_table.geometry)], :auto)
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end

    # Overlay points with polygons
    for j in 1:length(points)
        for i in 1:length(geo_table.geometry)
            if points[j] in geo_table.geometry[i]
                print("point ", j, " in ecoregion", i)
                overlay_df[j, i] = true
            break
            end 
        end
    end
    return overlay_df
end

df_meshes = overlay_meshes(points, geo_table)

############################# GMT ##############################
# Read shapefile
# Why the gmt.jl object has a different length than the objects read with rasters.jl, meshes.jl and ArchGDAL.jl?
geo_gmt = gmtread("wwf_terr_ecos.shp")
lon = map(o -> (o.longitude), obs)
lat = map(o -> (o.latitude), obs)

# Function using GMT.jl
function overlay_gmt(geo_gmt, lon, lat)
    # Initiate dataframe to hol results
    overlay_df = DataFrame([Vector{Bool}(undef, length(lon)) for i in 1:length(geo_gmt)], :auto)
    # Replace true by false
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end
    
	iswithin(bbox, lon, lat) = (lon >= bbox[1] && lon <= bbox[2] && lat >= bbox[3] && lat <= bbox[4])
    for j in 1:length(points)
        for i = 1:length(geo_gmt)
            !iswithin(geo_gmt[i].bbox, lon[j], lat[j]) && continue
            r = gmtselect([lon lat], polygon = geo_gmt[i])
            if (!isempty(r))
                # Populate dataframe with results
                overlay_df[j, i] = true 
                #println("Point falls inside polygon $(i)")
                break
            end
        end
    end
    return overlay_df
end

df_gmt = overlay_gmt(geo_gmt, lon, lat)

############################### ArchGDAL ######################################
# Read shapefile with ArchGDAL
eco_sf = ArchGDAL.getlayer(ArchGDAL.read("wwf_terr_ecos.shp"), 0)
lon = map(o -> (o.longitude), obs)
lat = map(o -> (o.latitude), obs)

function overlay_archgdal(eco_sf, lon, lat)
    # Transform GBIF occurrences in ArchGDAL point geometries
    points = ArchGDAL.createpoint.(lon, lat)

    # Initiate dataframe to hold results
    overlay_df = DataFrame([Vector{Bool}(undef, length(points)) for i in 1:length(eco_sf)], :auto)
    # Replace true by false
    for j in 1:nrow(overlay_df)
        for i in 1:ncol(overlay_df)
            if overlay_df[j,i] == true
                overlay_df[j,i] = false
            end
        end
    end

    for j in 1:length(points)
        for i in 0:ArchGDAL.nfeature(eco_sf)-1
            ArchGDAL.getfeature(eco_sf, i) do feature
                ArchGDAL.getgeom(feature) do geom
                    # Check if point is contained in the ecoregion polygon
                    overlay_df[j, i+1] = ArchGDAL.contains(geom, points[j]) 
                end
            end
        end
    end
    return overlay_df
end

df_archgdal = overlay_archgdal(eco_sf, lon, lat)

# Benchmark using Benchmark.jl
@btime overlay_archgdal(eco_sf, lon, lat) 
@btime overlay_gmt(geo_gmt, lon, lat)
@btime overlay_meshes(points, geo_table)
@btime overlay_rasters(points, geoms)

I would consider switching the order of loops so that you don’t have to do getfeature and getgeom on each feature length(points) times.

Regarding the Meshes.jl code, avoid using geo_table.geometry because that access is not type stable. If you care about performance you can use:

geotable = GeoTables.load("foo.shp")

# convert Tuple to Meshes.Point
ps = Point.(points)

# get geometries from table using a type stable method
gs = domain(geotable)

# incidence matrix
[p ∈ g for p in ps, g in gs]

inpolygon in Rasters.jl is really a multi-point method, slow for single points. Its just for convenience that it even works for one point. It’s fast for many points by sorting the points and doing them all together.

But you’re looping and calling it for each point. You can instead just pass in the whole geometry as-is, and get back a vector of Bool.

1 Like