Spatial join with dataframes


Increasingly I find myself needing to join dataframes on Geographic Information System (GIS) objects and it seems no capability exists to do so in the Julia eco-system. Nor does it seem there is much interest (c.f. Perform spatial join with GeoDataFrames) . Using python I could do this

import geopandas as gpd
import pandas as pd

gdfa = gpd.read_file('./a.csv') # polygons
gdfb = gpd.read_file('./b.csv') # points
gdfc = gdfb.sjoin(gdfa, how="left", predicate="within")

Without any specific Julia tooling for this I have tried the following without success

using DataFrames
using ArchGDAL
using CSV

dsa ="./a.csv")
layera = ArchGDAL.getlayer(dsa, 0)
dfa = layera |> DataFrame
rename!(dfa, ""=> :geometry)

dsb ="./b.csv")
layerb = ArchGDAL.getlayer(dsb, 0)
dfb = layerb |> DataFrame
rename!(dfb, ""=> :geometry)

import Base.(==)
import Base.isequal
import Base.isless

function ==(a::ArchGDAL.IGeometry{ArchGDAL.wkbPoint},b::ArchGDAL.IGeometry{ArchGDAL.wkbPolygon})

function ==(a::ArchGDAL.IGeometry{ArchGDAL.wkbPolygon},b::ArchGDAL.IGeometry{ArchGDAL.wkbPoint})

innerjoin(dfa,dfb, on=:geometry, makeunique=true)

Thoughts and advice welcome!

The data used in the examples above are here:

I would assume that [ANN] FlexiJoins.jl: fresh take on joining datasets by @aplavin should be able to do it.

1 Like

I fear it’s not yet doable within Julia as I know of no package that allows for completely generic predicates. I couldn’t get FlexiJoins predicates to work with custom methods, and I think both DataFrames as FlexiJoins eventually hash the values to do comparisons, which won’t work on geometries.

For now the workaround is to produce the crossjoin (product of the two tables) and filter them:

using GeoDataFrames
using DataFrames

dfa ="a.csv")
dfb ="b.csv")

X = crossjoin(dfa, dfb, makeunique=true)
subset!(X, [:geometry, :geometry_1] => (a, b) -> intersects.(a, b))
9×6 DataFrame
 Row │ geometry              WKT                                a       geometry_1          WKT_1 ⋯
     │ IGeometr…             String                             String  IGeometry…          Strin ⋯
   1 │ Geometry: wkbPolygon  POLYGON ((1890386.16948909 55012…  1       Geometry: wkbPoint  POINT ⋯
   2 │ Geometry: wkbPolygon  POLYGON ((1796386.75606753 55606…  2       Geometry: wkbPoint  POINT
   3 │ Geometry: wkbPolygon  POLYGON ((1811431.72747276 56325…  3       Geometry: wkbPoint  POINT
   4 │ Geometry: wkbPolygon  POLYGON ((1776387.29179533 55767…  4       Geometry: wkbPoint  POINT
   5 │ Geometry: wkbPolygon  POLYGON ((1776318.37508329 55768…  5       Geometry: wkbPoint  POINT ⋯
   6 │ Geometry: wkbPolygon  POLYGON ((1818262.41701443 55449…  6       Geometry: wkbPoint  POINT
   7 │ Geometry: wkbPolygon  POLYGON ((1818162.05551015 55440…  7       Geometry: wkbPoint  POINT
   8 │ Geometry: wkbPolygon  POLYGON ((1818431.18365221 55443…  8       Geometry: wkbPoint  POINT
   9 │ Geometry: wkbPolygon  POLYGON ((1818162.8216453 554476…  9       Geometry: wkbPoint  POINT ⋯
1 Like

FlexiJoins support optimized spatial joins of points - as in “find pairs of points within the specified distance” or “find the closest match in B for each point from A”.
But no optimization for polygon joins is available there, only the naive O(n^2) looping approach. This is potentially in scope for FlexiJoins, so feel free to propose an implementation that uses some optimized polygon lookup library. It shouldn’t be difficult, just isn’t in my own plans.

FlexiJoins support lots of join predicates that don’t do hashing. Optimized polygon joins just aren’t implemented because I never needed them.

1 Like

I don’t know if this will help

1 Like

How would one execute the naive version with the above example? Can it make use of intersect/within methods of other packages?

That’s only fair, and I wouldn’t expect you to implement spatial geometry operations. There are many types of geometries and spatial predicates, some of which are implemented in other packages. What could be useful is an interface definition for making optimized lookups for joins, so packages can implement their own algorithms for their specific types.

For example for simple lookups in vectors (no joins), there is GitHub - andyferris/AcceleratedArrays.jl: Arrays with acceleration indices, which speeds up things like findall (discussion about it here DataFrames.jl/issues/2381. Similarly, I’ve made GitHub - evetion/GeoAcceleratedArrays.jl: AcceleratedArrays with spatial indexing in the past, which does the same with a spatial index.

PolygonInbounds.jl is a good example of a specific subset of geospatial predicates, written natively in Julia Overall there are more (DE-9IM - Wikipedia) and these are covered by packages like LibGEOS and ArchGDAL (itself linking to GEOS via GDAL) for the package geometry types. GitHub - JuliaGeo/GeoInterface.jl: A Julia Protocol for Geospatial Data provides an interface for such operations.

Note that GEOS has a thing like prepared geometries, which make repeated spatial operations much faster on them.

FlexiJoins can do joins with arbitrary predicates by just looping over all pairs:

julia> using FlexiJoins

julia> my_predicate(s, r) = occursin(r, s)

julia> ss = ["abc", "def", "xyz"]
julia> rs = [r"a|b", r".{3}"]

julia> innerjoin((;ss, rs), by_pred(identity, my_predicate, identity), mode=FlexiJoins.Mode.NestedLoop(), loop_over_side=1)
4-element StructArray(view(::Vector{String}, [1, 1, 2, 3]), view(::Vector{Regex}, [1, 2, 2, 2])) with eltype NamedTuple{(:ss, :rs), Tuple{String, Regex}}:
 (ss = "abc", rs = r"a|b")
 (ss = "abc", rs = r".{3}")
 (ss = "def", rs = r".{3}")
 (ss = "xyz", rs = r".{3}")

It’s just a pretty rare [to me] operation, so maybe requires more boilerplate than strictly needed - eg, loop_over_side can be chosen arbitrarily for nested loop joins anyway.

That’s exactly how FlexiJoins is designed. As a similar example, I don’t implement spatial point searches myself, but just utilize NearestNeighbors.jl for distance queries. Basic support for that is just the first few lines in src/nearestneighbors.jl · master · Alexander Plavin / FlexiJoins.jl · GitLab.

This interface isn’t advertised for now, and maybe not the most optimal one - but it works.

Yeah, I’m aware of AcceleratedArrays, but it wasn’t really clear on how to utilize them in FlexiJoins. Maybe it’s possible though!

I’m not sure I understand the remarks in response to my question.
Surely also because I misquoted the reference to the discussion “Efficiently check if points are contained in polygons”.
What I asked to have clarified is if the use of one of the packages mentioned in the post (GMT.jl, rasters.jl, etc.) would not respond to Pavlin’s request

That is if using FlexiJoins with (for example) GMT doesn’t solve your problem.

Here is an attempt to use FlexiJoins by first finding a bounding box for each geometry. This reduces the quadratic complexity of the query (depending on how annoying the geometries are):

# prep data & environment
using DataFrames, GeoDataFrames, ArchGDAL, FlexiJoins, IntervalSets

dfa ="a.csv")
dfb ="b.csv")

getbb(g) = begin
    gg = ArchGDAL.boundingbox(g)
    gg1 = ArchGDAL.getgeom(gg,0)
    x1, y2 = ArchGDAL.getpoint(gg1,0)
    x2, y1 = ArchGDAL.getpoint(gg1,2)
    (x1..x2, y1..y2)

getxy(p) = begin
    x,y = ArchGDAL.getpoint(p,0)
    (x, y)
# process data
dfa2 = transform(dfa, 
  :geometry => ByRow(getbb) => [:xinterval, :yinterval])
dfb2 = transform(dfb, 
  :geometry => ByRow(getxy) => [:x, :y])

    (innerjoin((dfb2,dfa2), by_pred(:x, ∈, :xinterval)),dfa2), 
    by_pred(:y, ∈, :yinterval)
  ), [:geometry, :geometry_1] => (a, b) -> intersects.(a, b))

The results are similar to the ones from a crossjoin. The bounding boxes and intervals may be of further use possibly.


@Dan nice solution! Should definitely outperform the naive O(n^2) one for large datasets, even with the overhead of two sequential joins.
Btw, materializing x and y coordinates into separate columns isn’t required: you can pass r -> first(r.geometry) instead of :x in by_pred.

The best interface would be innerjoin((L, R), by_pred(:geometry, ∈, :geometry)), and this is potentially possible with FlexiJoins - if someone feels like introducing an optimized implementation for that.

1 Like

Thanks everyone for engaging in this thread, it is exemplary of a great community. I am especially grateful to @evetion’s workaround and to have be made aware of @aplavin’s FlexiJoin package which I expect to make use of in many other contexts. I also appreciate @Dan’s clever application of FlexiJoins’s interval predicates to solve the problem.

With regard to the particulars that motivated the post, due to the size of the actual problem (3 million polygons and 1 million points), on my computer(32GB mem) the workaround fails out of hand at the crossjoin with ERROR: OutOfMemoryError(). And the Flexijoin-with-bounding-boxed-polygons steadily increases memory consumption until killed by the OS memory manager. An optimized FlexiJoin implementation with interface like innerjoin((L, R), by_pred(:geometry, ∈, :geometry)), looks a promising way forward.

Joining 1 million numbers with 10 million intervals works just fine on my laptop, and doesn’t noticeably increase memory consumption:

julia> xs = rand(10^6);
julia> ys = rand(10^7);
julia> ints = Interval.(ys, ys .+ 1e-7);

julia> @time innerjoin((xs, ints), by_pred(identity, ∈, identity)) |> length
 25.102649 seconds (84 allocations: 27.195 MiB)

@Dan’s trick only uses FlexiJoins for this kind of interval joins, so should perform similarly.
If you can provide a simple reproducer without fancy types (no ArchGDAL, GeoDataFrames, DataFrames - only arrays, numbers, and intervals), I’ll try to look at it to see what goes wrong.

For example, is it possible that just too many point-bbox pairs match, and the resulting array doesn’t fit into memory?

For example, is it possible that just too many point-bbox pairs match, and the resulting array doesn’t fit into memory?

Yes, I think that’s the situation. The result of the geopandas join has size c.a. 60,000 greater than the 3 million starting. That is, about a 1% increase, which I think situation basically corresponds to

xs = rand(10^6);
ys = rand(10^6);
nts = Interval.(ys, ys .+ 1e-2);
@time innerjoin((xs, ints), by_pred(identity, ∈, identity)) |> length

and which, it seems, doesn’t fit in memory.

This results in about 10^6 * 10^6 * 10^-2 = 10^10 or 10 billion matches. No surprise, they don’t readily fit into memory!

1 Like

When using the BBox to filter the points, the FlexiJoins are nested. So if there are too many records in the first join, we can:

  1. switch the order of X and Y join - in case one axis is a better filter than the other and should go first.

  2. chunk the points into batches, and perform the double filter on chunks - after two joins there may be fewer points left which would fit memory. When we get to chunks of size 1, this is like a big for loop on the points, and if result still doesn’t fit in memory…

  3. Instead of materializing any more big DataFrames, just indices to point and polygon frames can be stored. This should drop the memory requirements to about two Ints per candidate intersection.

1 Like

That’s how FlexiJoins work already: it computes indices, and the output is a view of the input.
DataFrames are special though - they don’t follow the collection interface, and require special handling here. This integration is less tested and maybe makes some unnecessary copies?..

Great stuff! I didn’t know about these keyword arguments, so the MWE would now be:

using GeoDataFrames
using FlexiJoins

dfa ="a.csv")
dfb ="b.csv")

# Either a custom predicate that takes two rows of the DataFrame
my_predicate(s, r) = GeoDataFrames.within(r.geometry, s.geometry)  # reversed
innerjoin((dfa, dfb), by_pred(identity, my_predicate, identity), mode=FlexiJoins.Mode.NestedLoop(), loop_over_side=1)

# Or a version that works directly on the fields
innerjoin((dfa, dfb), by_pred(:geometry, GeoDataFrames.contains, :geometry), mode=FlexiJoins.Mode.NestedLoop(), loop_over_side=1)

Is the loop_over_side=1 documented somewhere? Without it you get an error that there’s no method swap_sides(f).

Ah, I indeed didn’t fully understand your question. I took it as a suggestion for spatial intersections/geometry, and I suggested more. I think these can’t do this spatial join themselves, but they can be a provider for the predicate function in a naive, non-optimized (n^2) way. The only fast way is doing spatial indexing, which is essentially an optimized version of the boundingbox approach by Dan later in this topic.

That would be great! I guess it could be another Mode? Instead of the predicate getting `element * element in the Nested Mode, it would require a mode that does element * vector. The vector being the AcceleratedArray column in a table.

There is a literature on spatial join techniques. I’m not sure how far along Julia’s packages have gotten.