Spatial join with dataframes

I don’t know if this will help

1 Like

How would one execute the naive version with the above example? Can it make use of intersect/within methods of other packages?

That’s only fair, and I wouldn’t expect you to implement spatial geometry operations. There are many types of geometries and spatial predicates, some of which are implemented in other packages. What could be useful is an interface definition for making optimized lookups for joins, so packages can implement their own algorithms for their specific types.

For example for simple lookups in vectors (no joins), there is GitHub - andyferris/AcceleratedArrays.jl: Arrays with acceleration indices, which speeds up things like findall (discussion about it here DataFrames.jl/issues/2381. Similarly, I’ve made GitHub - evetion/GeoAcceleratedArrays.jl: AcceleratedArrays with spatial indexing in the past, which does the same with a spatial index.

PolygonInbounds.jl is a good example of a specific subset of geospatial predicates, written natively in Julia Overall there are more (DE-9IM - Wikipedia) and these are covered by packages like LibGEOS and ArchGDAL (itself linking to GEOS via GDAL) for the package geometry types. GitHub - JuliaGeo/GeoInterface.jl: A Julia Protocol for Geospatial Data provides an interface for such operations.

Note that GEOS has a thing like prepared geometries, which make repeated spatial operations much faster on them.

FlexiJoins can do joins with arbitrary predicates by just looping over all pairs:

julia> using FlexiJoins

julia> my_predicate(s, r) = occursin(r, s)

julia> ss = ["abc", "def", "xyz"]
julia> rs = [r"a|b", r".{3}"]

julia> innerjoin((;ss, rs), by_pred(identity, my_predicate, identity), mode=FlexiJoins.Mode.NestedLoop(), loop_over_side=1)
4-element StructArray(view(::Vector{String}, [1, 1, 2, 3]), view(::Vector{Regex}, [1, 2, 2, 2])) with eltype NamedTuple{(:ss, :rs), Tuple{String, Regex}}:
 (ss = "abc", rs = r"a|b")
 (ss = "abc", rs = r".{3}")
 (ss = "def", rs = r".{3}")
 (ss = "xyz", rs = r".{3}")

It’s just a pretty rare [to me] operation, so maybe requires more boilerplate than strictly needed - eg, loop_over_side can be chosen arbitrarily for nested loop joins anyway.

That’s exactly how FlexiJoins is designed. As a similar example, I don’t implement spatial point searches myself, but just utilize NearestNeighbors.jl for distance queries. Basic support for that is just the first few lines in src/nearestneighbors.jl · master · Alexander Plavin / FlexiJoins.jl · GitLab.

This interface isn’t advertised for now, and maybe not the most optimal one - but it works.

Yeah, I’m aware of AcceleratedArrays, but it wasn’t really clear on how to utilize them in FlexiJoins. Maybe it’s possible though!

I’m not sure I understand the remarks in response to my question.
Surely also because I misquoted the reference to the discussion “Efficiently check if points are contained in polygons”.
What I asked to have clarified is if the use of one of the packages mentioned in the post (GMT.jl, rasters.jl, etc.) would not respond to Pavlin’s request

That is if using FlexiJoins with (for example) GMT doesn’t solve your problem.

Here is an attempt to use FlexiJoins by first finding a bounding box for each geometry. This reduces the quadratic complexity of the query (depending on how annoying the geometries are):

# prep data & environment
using DataFrames, GeoDataFrames, ArchGDAL, FlexiJoins, IntervalSets

dfa = GeoDataFrames.read("a.csv")
dfb = GeoDataFrames.read("b.csv")

getbb(g) = begin
    gg = ArchGDAL.boundingbox(g)
    gg1 = ArchGDAL.getgeom(gg,0)
    x1, y2 = ArchGDAL.getpoint(gg1,0)
    x2, y1 = ArchGDAL.getpoint(gg1,2)
    (x1..x2, y1..y2)
end

getxy(p) = begin
    x,y = ArchGDAL.getpoint(p,0)
    (x, y)
end
# process data
dfa2 = transform(dfa, 
  :geometry => ByRow(getbb) => [:xinterval, :yinterval])
dfb2 = transform(dfb, 
  :geometry => ByRow(getxy) => [:x, :y])

subset!(
  innerjoin(
    (innerjoin((dfb2,dfa2), by_pred(:x, ∈, :xinterval)),dfa2), 
    by_pred(:y, ∈, :yinterval)
  ), [:geometry, :geometry_1] => (a, b) -> intersects.(a, b))

The results are similar to the ones from a crossjoin. The bounding boxes and intervals may be of further use possibly.

3 Likes

@Dan nice solution! Should definitely outperform the naive O(n^2) one for large datasets, even with the overhead of two sequential joins.
Btw, materializing x and y coordinates into separate columns isn’t required: you can pass r -> first(r.geometry) instead of :x in by_pred.

The best interface would be innerjoin((L, R), by_pred(:geometry, ∈, :geometry)), and this is potentially possible with FlexiJoins - if someone feels like introducing an optimized implementation for that.

1 Like

Thanks everyone for engaging in this thread, it is exemplary of a great community. I am especially grateful to @evetion’s workaround and to have be made aware of @aplavin’s FlexiJoin package which I expect to make use of in many other contexts. I also appreciate @Dan’s clever application of FlexiJoins’s interval predicates to solve the problem.

With regard to the particulars that motivated the post, due to the size of the actual problem (3 million polygons and 1 million points), on my computer(32GB mem) the workaround fails out of hand at the crossjoin with ERROR: OutOfMemoryError(). And the Flexijoin-with-bounding-boxed-polygons steadily increases memory consumption until killed by the OS memory manager. An optimized FlexiJoin implementation with interface like innerjoin((L, R), by_pred(:geometry, ∈, :geometry)), looks a promising way forward.

Joining 1 million numbers with 10 million intervals works just fine on my laptop, and doesn’t noticeably increase memory consumption:

julia> xs = rand(10^6);
julia> ys = rand(10^7);
julia> ints = Interval.(ys, ys .+ 1e-7);

julia> @time innerjoin((xs, ints), by_pred(identity, ∈, identity)) |> length
 25.102649 seconds (84 allocations: 27.195 MiB)
1001054

@Dan’s trick only uses FlexiJoins for this kind of interval joins, so should perform similarly.
If you can provide a simple reproducer without fancy types (no ArchGDAL, GeoDataFrames, DataFrames - only arrays, numbers, and intervals), I’ll try to look at it to see what goes wrong.

For example, is it possible that just too many point-bbox pairs match, and the resulting array doesn’t fit into memory?

For example, is it possible that just too many point-bbox pairs match, and the resulting array doesn’t fit into memory?

Yes, I think that’s the situation. The result of the geopandas join has size c.a. 60,000 greater than the 3 million starting. That is, about a 1% increase, which I think situation basically corresponds to

xs = rand(10^6);
ys = rand(10^6);
nts = Interval.(ys, ys .+ 1e-2);
@time innerjoin((xs, ints), by_pred(identity, ∈, identity)) |> length

and which, it seems, doesn’t fit in memory.

This results in about 10^6 * 10^6 * 10^-2 = 10^10 or 10 billion matches. No surprise, they don’t readily fit into memory!

1 Like

When using the BBox to filter the points, the FlexiJoins are nested. So if there are too many records in the first join, we can:

  1. switch the order of X and Y join - in case one axis is a better filter than the other and should go first.

  2. chunk the points into batches, and perform the double filter on chunks - after two joins there may be fewer points left which would fit memory. When we get to chunks of size 1, this is like a big for loop on the points, and if result still doesn’t fit in memory…

  3. Instead of materializing any more big DataFrames, just indices to point and polygon frames can be stored. This should drop the memory requirements to about two Ints per candidate intersection.

1 Like

That’s how FlexiJoins work already: it computes indices, and the output is a view of the input.
DataFrames are special though - they don’t follow the collection interface, and require special handling here. This integration is less tested and maybe makes some unnecessary copies?..

Great stuff! I didn’t know about these keyword arguments, so the MWE would now be:

using GeoDataFrames
using FlexiJoins

dfa = GeoDataFrames.read("a.csv")
dfb = GeoDataFrames.read("b.csv")

# Either a custom predicate that takes two rows of the DataFrame
my_predicate(s, r) = GeoDataFrames.within(r.geometry, s.geometry)  # reversed
innerjoin((dfa, dfb), by_pred(identity, my_predicate, identity), mode=FlexiJoins.Mode.NestedLoop(), loop_over_side=1)

# Or a version that works directly on the fields
innerjoin((dfa, dfb), by_pred(:geometry, GeoDataFrames.contains, :geometry), mode=FlexiJoins.Mode.NestedLoop(), loop_over_side=1)

Is the loop_over_side=1 documented somewhere? Without it you get an error that there’s no method swap_sides(f).

Ah, I indeed didn’t fully understand your question. I took it as a suggestion for spatial intersections/geometry, and I suggested more. I think these can’t do this spatial join themselves, but they can be a provider for the predicate function in a naive, non-optimized (n^2) way. The only fast way is doing spatial indexing, which is essentially an optimized version of the boundingbox approach by Dan later in this topic.

That would be great! I guess it could be another Mode? Instead of the predicate getting `element * element in the Nested Mode, it would require a mode that does element * vector. The vector being the AcceleratedArray column in a table.

There is a literature on spatial join techniques. I’m not sure how far along Julia’s packages have gotten.

That’s a very nice example! Might I suggest to use GeoInterface for most of these operations? In that way, this would also work on Shapefile geometries, etc.

GeoInterface.x(point)  # to get the x coordinate, which is a shortcut for:
GeoInterface.getcoord(point, 1)  # for x, or without the 1 for a generator

GeoInterface.extent(geom)  # Gives you an Extents.Extent object, which is a NamedTuple
Extent(X = (1.89038215436677e6, 1.89038215436677e6), Y = (5.50127974222017e6, 5.50127974222017e6))

Well you’re reading about the cutting edge right here. :wink: So not very far I’m afraid, but given that we have great Table like packages, geometry packages and spatial index packages, it seems we should be pretty close, we just need to find a way for them to work together.

2 Likes

Yes sure. In my idea the linked scripts were potentially usable as a predicate from flexijoins.
In this sense I was addressing @aplavin , who asked for the use of efficient spatial search algorithms to be employed within flexijoins
In particular, in my quick reading, this part seemed more interesting to me, which is perhaps better than O(n^2).

“GMT.jl can do this pretty quickly. When one reads a shape file (or other OGR formats) we get a vector of GMTdataset that has, among others, a boundingbox field for each polygon. So we can do the quick scan through to see if points are, or not, inside each of the polygon’s BB and if yes than use the gmtselect module that gets us the point-in-polygon answer.”

No, and I’m not sure if this argument is really needed - so maybe it’s just a temporary workaround, in the future it won’t be required.

The ideal way would be to utilize columns that are AcceleratedArrays automatically, for the operations they support. At first, it could require manual mode selection, as you say.

Currently, reusing data preparation (like sort/tree building/hashing) in multiple joins with the same dataset is nontrivial and undocumented - through possible, and I use this functionality from time to time. AcceleratedArrays can make this cleaner and more straightforward (also more explicit?).