Spatial join with dataframes

jade_mackay · January 20, 2023, 1:07am

Hello,

Increasingly I find myself needing to join dataframes on Geographic Information System (GIS) objects and it seems no capability exists to do so in the Julia eco-system. Nor does it seem there is much interest (c.f. Perform spatial join with GeoDataFrames) . Using python I could do this

import geopandas as gpd
import pandas as pd

gdfa = gpd.read_file('./a.csv') # polygons
gdfb = gpd.read_file('./b.csv') # points
gdfc = gdfb.sjoin(gdfa, how="left", predicate="within")

Without any specific Julia tooling for this I have tried the following without success

using DataFrames
using ArchGDAL
using CSV

dsa = ArchGDAL.read("./a.csv")
layera = ArchGDAL.getlayer(dsa, 0)
dfa = layera |> DataFrame
rename!(dfa, ""=> :geometry)

dsb = ArchGDAL.read("./b.csv")
layerb = ArchGDAL.getlayer(dsb, 0)
dfb = layerb |> DataFrame
rename!(dfb, ""=> :geometry)

import Base.(==)
import Base.isequal
import Base.isless

function ==(a::ArchGDAL.IGeometry{ArchGDAL.wkbPoint},b::ArchGDAL.IGeometry{ArchGDAL.wkbPolygon})
    ArchGDAL.within(a,b)
end

function ==(a::ArchGDAL.IGeometry{ArchGDAL.wkbPolygon},b::ArchGDAL.IGeometry{ArchGDAL.wkbPoint})
    ArchGDAL.within(b,a)
end

innerjoin(dfa,dfb, on=:geometry, makeunique=true)

Thoughts and advice welcome!

The data used in the examples above are here:

gist.github.com

https://gist.github.com/jademackay/d780816950be775ed7f252bc3010657f

a.csv

WKT,a
"POLYGON ((1890386.16948909 5501282.71815877,1890380.89379144 5501274.90606802,1890378.13924445 5501276.76628158,1890383.41494209 5501284.57837232,1890386.16948909 5501282.71815877))",1
"POLYGON ((1796386.75606753 5560662.12555495,1796394.05307637 5560659.9364523,1796392.63575434 5560655.21204553,1796385.3387455 5560657.40114819,1796386.75606753 5560662.12555495))",2
"POLYGON ((1811431.72747276 5632559.04409465,1811437.18327245 5632566.07190443,1811447.58036167 5632558.00047991,1811442.34321831 5632551.25432914,1811431.72747276 5632559.04409465))",3
"POLYGON ((1776387.29179533 5576785.71175639,1776394.41016537 5576781.76570344,1776390.89812474 5576775.4302576,1776383.7797547 5576779.37631056,1776387.29179533 5576785.71175639))",4
"POLYGON ((1776318.37508329 5576879.26887317,1776321.88071232 5576890.86441537,1776332.20006063 5576887.74461239,1776329.8129423 5576879.84875945,1776332.41306799 5576879.06267494,1776331.10260642 5576874.72807127,1776332.34060996 5576874.35379113,1776331.18938583 5576870.54589591,1776325.93980525 5576872.13297841,1776326.39635005 5576873.64308813,1776320.69962423 5576875.36535408,1776321.58625443 5576878.29805399,1776318.37508329 5576879.26887317))",5
"POLYGON ((1818262.41701443 5544984.55939418,1818267.65014384 5544975.25605301,1818260.72094908 5544971.35838095,1818255.48781967 5544980.66172212,1818262.41701443 5544984.55939418))",6
"POLYGON ((1818162.05551015 5544096.61576715,1818172.35228283 5544078.54734234,1818149.24451137 5544065.50009723,1818139.11918351 5544083.43290681,1818139.01967741 5544083.6091405,1818162.05551015 5544096.61576715))",7
"POLYGON ((1818431.18365221 5544343.38550716,1818435.60531537 5544346.13394509,1818437.9022725 5544342.43862034,1818433.48060934 5544339.69018241,1818431.18365221 5544343.38550716))",8
"POLYGON ((1818162.8216453 5544769.71528392,1818165.70857342 5544778.21110094,1818177.45109838 5544774.22092256,1818174.56417026 5544765.72510553,1818162.8216453 5544769.71528392))",9

b.csv

WKT,b
POINT (1890382.15436677 5501279.74222017),1
POINT (1796389.69591094 5560658.66880024),2
POINT (1811439.67277428 5632558.61973037),3
POINT (1776389.09496004 5576780.571007),4
POINT (1776325.97568834 5576881.08664645),5
POINT (1818261.56898176 5544977.95888756),6
POINT (1818155.67066157 5544081.06338669),7
POINT (1818434.54296235 5544342.91206375),8
POINT (1818170.13637184 5544771.96810324),9

bkamins · January 20, 2023, 3:55am

I would assume that [ANN] FlexiJoins.jl: fresh take on joining datasets by @aplavin should be able to do it.

evetion · January 20, 2023, 9:46am

I fear it’s not yet doable within Julia as I know of no package that allows for completely generic predicates. I couldn’t get FlexiJoins predicates to work with custom methods, and I think both DataFrames as FlexiJoins eventually hash the values to do comparisons, which won’t work on geometries.

For now the workaround is to produce the crossjoin (product of the two tables) and filter them:

using GeoDataFrames
using DataFrames

dfa = GeoDataFrames.read("a.csv")
dfb = GeoDataFrames.read("b.csv")

X = crossjoin(dfa, dfb, makeunique=true)
subset!(X, [:geometry, :geometry_1] => (a, b) -> intersects.(a, b))
9×6 DataFrame
 Row │ geometry              WKT                                a       geometry_1          WKT_1 ⋯
     │ IGeometr…             String                             String  IGeometry…          Strin ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Geometry: wkbPolygon  POLYGON ((1890386.16948909 55012…  1       Geometry: wkbPoint  POINT ⋯
   2 │ Geometry: wkbPolygon  POLYGON ((1796386.75606753 55606…  2       Geometry: wkbPoint  POINT
   3 │ Geometry: wkbPolygon  POLYGON ((1811431.72747276 56325…  3       Geometry: wkbPoint  POINT
   4 │ Geometry: wkbPolygon  POLYGON ((1776387.29179533 55767…  4       Geometry: wkbPoint  POINT
   5 │ Geometry: wkbPolygon  POLYGON ((1776318.37508329 55768…  5       Geometry: wkbPoint  POINT ⋯
   6 │ Geometry: wkbPolygon  POLYGON ((1818262.41701443 55449…  6       Geometry: wkbPoint  POINT
   7 │ Geometry: wkbPolygon  POLYGON ((1818162.05551015 55440…  7       Geometry: wkbPoint  POINT
   8 │ Geometry: wkbPolygon  POLYGON ((1818431.18365221 55443…  8       Geometry: wkbPoint  POINT
   9 │ Geometry: wkbPolygon  POLYGON ((1818162.8216453 554476…  9       Geometry: wkbPoint  POINT ⋯

aplavin · January 20, 2023, 10:24am

FlexiJoins support optimized spatial joins of points - as in “find pairs of points within the specified distance” or “find the closest match in B for each point from A”.
But no optimization for polygon joins is available there, only the naive O(n^2) looping approach. This is potentially in scope for FlexiJoins, so feel free to propose an implementation that uses some optimized polygon lookup library. It shouldn’t be difficult, just isn’t in my own plans.

FlexiJoins support lots of join predicates that don’t do hashing. Optimized polygon joins just aren’t implemented because I never needed them.

rocco_sprmnt21 · January 20, 2023, 10:55am

I don’t know if this will help

evetion · January 20, 2023, 11:41am

How would one execute the naive version with the above example? Can it make use of intersect/within methods of other packages?

That’s only fair, and I wouldn’t expect you to implement spatial geometry operations. There are many types of geometries and spatial predicates, some of which are implemented in other packages. What could be useful is an interface definition for making optimized lookups for joins, so packages can implement their own algorithms for their specific types.

For example for simple lookups in vectors (no joins), there is GitHub - andyferris/AcceleratedArrays.jl: Arrays with acceleration indices, which speeds up things like findall (discussion about it here DataFrames.jl/issues/2381. Similarly, I’ve made GitHub - evetion/GeoAcceleratedArrays.jl: AcceleratedArrays with spatial indexing in the past, which does the same with a spatial index.

evetion · January 20, 2023, 11:47am

PolygonInbounds.jl is a good example of a specific subset of geospatial predicates, written natively in Julia Overall there are more (DE-9IM - Wikipedia) and these are covered by packages like LibGEOS and ArchGDAL (itself linking to GEOS via GDAL) for the package geometry types. GitHub - JuliaGeo/GeoInterface.jl: A Julia Protocol for Geospatial Data provides an interface for such operations.

Note that GEOS has a thing like prepared geometries, which make repeated spatial operations much faster on them.

aplavin · January 20, 2023, 3:15pm

FlexiJoins can do joins with arbitrary predicates by just looping over all pairs:

julia> using FlexiJoins

julia> my_predicate(s, r) = occursin(r, s)

julia> ss = ["abc", "def", "xyz"]
julia> rs = [r"a|b", r".{3}"]

julia> innerjoin((;ss, rs), by_pred(identity, my_predicate, identity), mode=FlexiJoins.Mode.NestedLoop(), loop_over_side=1)
4-element StructArray(view(::Vector{String}, [1, 1, 2, 3]), view(::Vector{Regex}, [1, 2, 2, 2])) with eltype NamedTuple{(:ss, :rs), Tuple{String, Regex}}:
 (ss = "abc", rs = r"a|b")
 (ss = "abc", rs = r".{3}")
 (ss = "def", rs = r".{3}")
 (ss = "xyz", rs = r".{3}")

It’s just a pretty rare [to me] operation, so maybe requires more boilerplate than strictly needed - eg, loop_over_side can be chosen arbitrarily for nested loop joins anyway.

That’s exactly how FlexiJoins is designed. As a similar example, I don’t implement spatial point searches myself, but just utilize NearestNeighbors.jl for distance queries. Basic support for that is just the first few lines in src/nearestneighbors.jl · master · Alexander Plavin / FlexiJoins.jl · GitLab.

This interface isn’t advertised for now, and maybe not the most optimal one - but it works.

Yeah, I’m aware of AcceleratedArrays, but it wasn’t really clear on how to utilize them in FlexiJoins. Maybe it’s possible though!

rocco_sprmnt21 · January 20, 2023, 5:01pm

I’m not sure I understand the remarks in response to my question.
Surely also because I misquoted the reference to the discussion “Efficiently check if points are contained in polygons”.
What I asked to have clarified is if the use of one of the packages mentioned in the post (GMT.jl, rasters.jl, etc.) would not respond to Pavlin’s request

That is if using FlexiJoins with (for example) GMT doesn’t solve your problem.

Dan · January 20, 2023, 5:29pm

Here is an attempt to use FlexiJoins by first finding a bounding box for each geometry. This reduces the quadratic complexity of the query (depending on how annoying the geometries are):

# prep data & environment
using DataFrames, GeoDataFrames, ArchGDAL, FlexiJoins, IntervalSets

dfa = GeoDataFrames.read("a.csv")
dfb = GeoDataFrames.read("b.csv")

getbb(g) = begin
    gg = ArchGDAL.boundingbox(g)
    gg1 = ArchGDAL.getgeom(gg,0)
    x1, y2 = ArchGDAL.getpoint(gg1,0)
    x2, y1 = ArchGDAL.getpoint(gg1,2)
    (x1..x2, y1..y2)
end

getxy(p) = begin
    x,y = ArchGDAL.getpoint(p,0)
    (x, y)
end

# process data
dfa2 = transform(dfa, 
  :geometry => ByRow(getbb) => [:xinterval, :yinterval])
dfb2 = transform(dfb, 
  :geometry => ByRow(getxy) => [:x, :y])

subset!(
  innerjoin(
    (innerjoin((dfb2,dfa2), by_pred(:x, ∈, :xinterval)),dfa2), 
    by_pred(:y, ∈, :yinterval)
  ), [:geometry, :geometry_1] => (a, b) -> intersects.(a, b))

The results are similar to the ones from a crossjoin. The bounding boxes and intervals may be of further use possibly.

aplavin · January 20, 2023, 11:53pm

@Dan nice solution! Should definitely outperform the naive O(n^2) one for large datasets, even with the overhead of two sequential joins.
Btw, materializing x and y coordinates into separate columns isn’t required: you can pass r -> first(r.geometry) instead of :x in by_pred.

The best interface would be innerjoin((L, R), by_pred(:geometry, ∈, :geometry)), and this is potentially possible with FlexiJoins - if someone feels like introducing an optimized implementation for that.

jade_mackay · January 21, 2023, 9:26pm

Thanks everyone for engaging in this thread, it is exemplary of a great community. I am especially grateful to @evetion’s workaround and to have be made aware of @aplavin’s FlexiJoin package which I expect to make use of in many other contexts. I also appreciate @Dan’s clever application of FlexiJoins’s interval predicates to solve the problem.

With regard to the particulars that motivated the post, due to the size of the actual problem (3 million polygons and 1 million points), on my computer(32GB mem) the workaround fails out of hand at the crossjoin with ERROR: OutOfMemoryError(). And the Flexijoin-with-bounding-boxed-polygons steadily increases memory consumption until killed by the OS memory manager. An optimized FlexiJoin implementation with interface like innerjoin((L, R), by_pred(:geometry, ∈, :geometry)), looks a promising way forward.

aplavin · January 21, 2023, 10:00pm

Joining 1 million numbers with 10 million intervals works just fine on my laptop, and doesn’t noticeably increase memory consumption:

julia> xs = rand(10^6);
julia> ys = rand(10^7);
julia> ints = Interval.(ys, ys .+ 1e-7);

julia> @time innerjoin((xs, ints), by_pred(identity, ∈, identity)) |> length
 25.102649 seconds (84 allocations: 27.195 MiB)
1001054

@Dan’s trick only uses FlexiJoins for this kind of interval joins, so should perform similarly.
If you can provide a simple reproducer without fancy types (no ArchGDAL, GeoDataFrames, DataFrames - only arrays, numbers, and intervals), I’ll try to look at it to see what goes wrong.

For example, is it possible that just too many point-bbox pairs match, and the resulting array doesn’t fit into memory?

jade_mackay · January 23, 2023, 6:52am

For example, is it possible that just too many point-bbox pairs match, and the resulting array doesn’t fit into memory?

Yes, I think that’s the situation. The result of the geopandas join has size c.a. 60,000 greater than the 3 million starting. That is, about a 1% increase, which I think situation basically corresponds to

xs = rand(10^6);
ys = rand(10^6);
nts = Interval.(ys, ys .+ 1e-2);
@time innerjoin((xs, ints), by_pred(identity, ∈, identity)) |> length

and which, it seems, doesn’t fit in memory.

aplavin · January 23, 2023, 11:53am

This results in about 10^6 * 10^6 * 10^-2 = 10^10 or 10 billion matches. No surprise, they don’t readily fit into memory!

Dan · January 23, 2023, 12:07pm

When using the BBox to filter the points, the FlexiJoins are nested. So if there are too many records in the first join, we can:

switch the order of X and Y join - in case one axis is a better filter than the other and should go first.
chunk the points into batches, and perform the double filter on chunks - after two joins there may be fewer points left which would fit memory. When we get to chunks of size 1, this is like a big for loop on the points, and if result still doesn’t fit in memory…
Instead of materializing any more big DataFrames, just indices to point and polygon frames can be stored. This should drop the memory requirements to about two Ints per candidate intersection.

aplavin · January 23, 2023, 12:40pm

That’s how FlexiJoins work already: it computes indices, and the output is a view of the input.
DataFrames are special though - they don’t follow the collection interface, and require special handling here. This integration is less tested and maybe makes some unnecessary copies?..

evetion · January 25, 2023, 8:16pm

Great stuff! I didn’t know about these keyword arguments, so the MWE would now be:

using GeoDataFrames
using FlexiJoins

dfa = GeoDataFrames.read("a.csv")
dfb = GeoDataFrames.read("b.csv")

# Either a custom predicate that takes two rows of the DataFrame
my_predicate(s, r) = GeoDataFrames.within(r.geometry, s.geometry)  # reversed
innerjoin((dfa, dfb), by_pred(identity, my_predicate, identity), mode=FlexiJoins.Mode.NestedLoop(), loop_over_side=1)

# Or a version that works directly on the fields
innerjoin((dfa, dfb), by_pred(:geometry, GeoDataFrames.contains, :geometry), mode=FlexiJoins.Mode.NestedLoop(), loop_over_side=1)

Is the loop_over_side=1 documented somewhere? Without it you get an error that there’s no method swap_sides(f).

Ah, I indeed didn’t fully understand your question. I took it as a suggestion for spatial intersections/geometry, and I suggested more. I think these can’t do this spatial join themselves, but they can be a provider for the predicate function in a naive, non-optimized (n^2) way. The only fast way is doing spatial indexing, which is essentially an optimized version of the boundingbox approach by Dan later in this topic.

evetion · January 25, 2023, 8:22pm

That would be great! I guess it could be another Mode? Instead of the predicate getting `element * element in the Nested Mode, it would require a mode that does element * vector. The vector being the AcceleratedArray column in a table.

jar1 · January 25, 2023, 8:31pm

There is a literature on spatial join techniques. I’m not sure how far along Julia’s packages have gotten.

Topic		Replies	Views
Perform spatial join with GeoDataFrames New to Julia dataframes , geo	3	879	January 20, 2023
[ANN] FlexiJoins.jl: fresh take on joining datasets Package Announcements	32	3182	April 3, 2025
Join dataframe and geotable New to Julia geostat	34	726	April 10, 2024
Conditional left join 2 dataframes when none of the columns are common General Usage dataframes , flexijoins	26	1979	May 26, 2022
[ANN] DataFrameIntervals.jl — joins on intervals of time Package Announcements dataframes	40	2283	October 31, 2022

Spatial join with dataframes

Related topics