Spatial join with dataframes

That’s a very nice example! Might I suggest to use GeoInterface for most of these operations? In that way, this would also work on Shapefile geometries, etc.

GeoInterface.x(point)  # to get the x coordinate, which is a shortcut for:
GeoInterface.getcoord(point, 1)  # for x, or without the 1 for a generator

GeoInterface.extent(geom)  # Gives you an Extents.Extent object, which is a NamedTuple
Extent(X = (1.89038215436677e6, 1.89038215436677e6), Y = (5.50127974222017e6, 5.50127974222017e6))

Well you’re reading about the cutting edge right here. :wink: So not very far I’m afraid, but given that we have great Table like packages, geometry packages and spatial index packages, it seems we should be pretty close, we just need to find a way for them to work together.

2 Likes

Yes sure. In my idea the linked scripts were potentially usable as a predicate from flexijoins.
In this sense I was addressing @aplavin , who asked for the use of efficient spatial search algorithms to be employed within flexijoins
In particular, in my quick reading, this part seemed more interesting to me, which is perhaps better than O(n^2).

β€œGMT.jl can do this pretty quickly. When one reads a shape file (or other OGR formats) we get a vector of GMTdataset that has, among others, a boundingbox field for each polygon. So we can do the quick scan through to see if points are, or not, inside each of the polygon’s BB and if yes than use the gmtselect module that gets us the point-in-polygon answer.”

No, and I’m not sure if this argument is really needed - so maybe it’s just a temporary workaround, in the future it won’t be required.

The ideal way would be to utilize columns that are AcceleratedArrays automatically, for the operations they support. At first, it could require manual mode selection, as you say.

Currently, reusing data preparation (like sort/tree building/hashing) in multiple joins with the same dataset is nontrivial and undocumented - through possible, and I use this functionality from time to time. AcceleratedArrays can make this cleaner and more straightforward (also more explicit?).

GeometryOps.jl offers this in combination with FlexiJoins today!

5 Likes

You the man, love the 1 year later followup (seriously)

Hello @asinghvi17!

I can confirm that this satisfies what was probably my motivating example with

julia> @time joined_df = FlexiJoins.innerjoin(
           (df_sites, df_links),
           by_pred(:geometry, GO.within, :geometry)
       )
  0.815245 seconds (3.20 M allocations: 244.915 MiB, 8.85% gc time)
4472Γ—4 DataFrame
  Row β”‚ geometry            site_ref  geometry_1            link_id 
      β”‚ IGeometr…           String    IGeometry…            String  
──────┼─────────────────────────────────────────────────────────────
    1 β”‚ Geometry: wkbPoint  02900020  Geometry: wkbPolygon  10062
...

and with flying colours!

Where inputs are

julia> df_sites
2042Γ—2 DataFrame
  Row β”‚ geometry            site_ref 
      β”‚ IGeometr…           String   
──────┼──────────────────────────────
    1 β”‚ Geometry: wkbPoint  01000007
...

and

julia> df_links
438406Γ—2 DataFrame
    Row β”‚ geometry              link_id 
        β”‚ IGeometr…             String  
────────┼───────────────────────────────
      1 β”‚ Geometry: wkbPolygon  1
...

constituting 438406 buffered linestrings (thanks GeoInterface).

Imports:

import GeoInterface as GI, GeometryOps as GO, GeoFormatTypes as GFT
using FlexiJoins, DataFrames, GeoDataFrames
using ArchGDAL

Thanks @asinghvi17 !
And thanks also @evetion, @aplavin, and others!

1 Like

This can also now also be achieved with geojoin in GeoStats.jl.

e.g.

# df is a dataframe containing X and Y coordinates on the British National Grid (EPSG27700)
# georef converts these coords to a geometry
# geojoin joins two geometries based on the given predicate 
geotable1 = georef(df, ("X", "Y"); crs=EPSG{27700})
geotable2 = GeoIO.load("MyShapefile.shp")
newtable  = geojoin(geotable1, geotable2, kind=:left, pred=((g1, g2) -> intersects(g1, g2)))
3 Likes