Join dataframe and geotable

A very basic question, I’m afraid.
I have a shape file of local authority boundaries downloaded from the Office for National Statistics and GeoIO read this as expected and Makie plots it.

One of the data fields in the shape file is the name of the local authority.
I also have a DataFrame which has a column listing the name of the local authority together with a bunch of data associated with that authority. The dataframe only consists of a subset of the local authorities in the shp file.

I want to join these two datasets on the LA name.

The geojoin function will join based on geometry but not (aparently) on the table data columns. Dataframe style leftjoin didn’t work either.

using GeoStats, GeoIO, CSV, DataFrames
import CairoMakie as mke
fname = "CTYUA_DEC_2023_UK_BFE.shp"
LAs=GeoIO.load(fname)|> Rename("CTYUA23NM" => "newLocalAuthority")
viz(LAs.geometry)
combo=CSV.read("combo.csv", DataFrame)
result = leftjoin(LAs, combo, on=:newLocalAuthority)

As a last resort, I tried Copilot but, as expected, it didn’t work.

result = join(LAs, combo, on=:newLocalAuthority)

How can I join on the table data?

Notice that geojoin operates on two GeoTables and has the same on keyword option provided by DataFrames.jl. You can set the geometric predicate to pred = == to join geometries when they are equal and specify the column with name with on = ["NAME"]:

help> geojoin
geojoin(geotable₁, geotable₂, var₁ => agg₁, ..., varₙ => aggₙ; kind=:left, pred=intersects, on=nothing)


  Joins geotable₁ with geotable₂ using a certain kind of join and predicate function pred that takes two geometries and returns a boolean ((g1, g2) -> g1 ⊆ g2).

  Optionally, add a variable value match to join, in addition to geometric match, by passing a single name or list of variable names (strings or symbols) to the on keyword
  argument. The variable value match use the isequal function.

  Whenever two or more matches are encountered, aggregate varᵢ with aggregation function aggᵢ. If no aggregation function is provided for a variable, then the aggregation
  function will be selected according to the scientific types: mean for continuous and first otherwise.

  Kinds
  =====

    •  :left - Returns all rows of geotable₁ filling entries with missing when there is no match in geotable₂.

    •  :inner - Returns the subset of rows of geotable₁ that has a match in geotable₂.

  Examples
  ≡≡≡≡≡≡≡≡

  geojoin(gtb1, gtb2)
  geojoin(gtb1, gtb2, 1 => mean)
  geojoin(gtb1, gtb2, :a => mean, :b => std)
  geojoin(gtb1, gtb2, "a" => mean, pred=issubset)
  geojoin(gtb1, gtb2, on=:a)
  geojoin(gtb1, gtb2, kind=:inner, on=["a", "b"])

If you don’t have two GeoTables, then you have two options:

  1. georef the table to a domain, to get two geotables
  2. extract the values(geotable) to operate on two tables

If you can share some data, we can try to help. But please try to read the GDSJL book before asking the question: Geospatial Data Science with Julia

I also experience frequent interoperability issues between the competing packages within the JuliaGeo ecosystem. I’ll leave that statement there for now because I don’t want to derail your thread, but I’ll probably write a new post about this as soon as I can put together my thoughts.

To get back on topic: GeoDataFrames is an alternative to GeoTables that is completely compatible with standard DataFrames, so the ordinary join methods always work:

using GeoDataFrames
LAs = GeoDataFrames.read(fname)
GeoDataFrames.rename!(LAs, "CTYUA23NM" => "newLocalAuthority")
result = leftjoin(LAs, combo, on=:newLocalAuthority)

But note that the geometry columns use different types in GeoDataFrames and GeoTables, so things like viz() won’t work out of the box. See comment above.

As far as I remember, GeoDataFrames.jl doesn’t provide geojoin, it is a normal join between the attributes that are not geometric. That is why it can mix with raw DataFrame.

Maintainers can jump in here if that has changed over the years.

I think you’re correct about that, but a standard non-geo table join seems to be what OP wants here.

In that case the approach I called (2) above is what the OP wants.

I’ve read the book and I find it very approachable as a complete novice.

The shape file I’m using is here.

The DataFrame is read from a CSV file that looks like this:

CA	newLocalAuthority	Total_Awarded	v4_0
Manchester Combined Authority	Oldham	68778222	242,072
Manchester Combined Authority	Stockport	73434137	295,243
Manchester Combined Authority	Wigan	94681945	329,759
Birminghan Combined Authority	Birmingham	1061913344	1,142,494
Birminghan Combined Authority	Coventry	196517288	343,320
Birminghan Combined Authority	Wolverhampton	129180179	264,036
Birminghan Combined Authority	Sandwell	196948660	341,729
Birminghan Combined Authority	Dudley	121778262	323,581
Birminghan Combined Authority	Walsall	118240689	284,306
Birminghan Combined Authority	Solihull	68466418	216,666
Liverpool Combined Authority	Liverpool	503145752	484,488
Liverpool Combined Authority	Wirral	89813139	320,600
Liverpool Combined Authority	Sefton	92179019	279,692
Liverpool Combined Authority	Halton	48831486	128,577

(This is just a subset, but it’s not big.)

The shape file contains the name of each LA, too, and so I want to join on this column. My plan is to generate some chloropleth (is that the right name for these things) maps of the data in the DF using the geometry in the shape file. In your book you have a map of Brazil like I want to produce.

Thanks,
Tim

1 Like

Thanks for reading the book, it is much easier to discuss the topics when we are talking the same language. The approach (2) may work for you:

using GeoStats
using GeoIO
using DataFrames
using CSV

geotable = GeoIO.load("file.shp")
table1 = values(geotable) |> DataFrame
table2 = CSV.File("file.csv") |> DataFrame

# join of normal tables from DataFrames.jl
table = leftjoin(table1, table2)

# georef the result back to the original domain
gt = georef(table, geotable.geometry)

I’ve tried this and superficially it works. However, I am uncertain about how the georef joins the table back with the geometry.

My dataframe only contains data for a few of the major metropolitan areas in England. The table produced by the leftjoin is still accurate, matching as I would expect.

However, after recombining the data and the geometry and mapping using one of the DF columns to define colour, I get a map that is of very different areas than I’m expecting. Has the georef randomised the match between the table and the geometry?

table1 = values(LAs) |> DataFrame
combo.col .= Int.(trunc.(combo.Total_Awarded./1000000)) # I'll think about a better way to define colour later!
table = leftjoin(table1, combo, on=:newLocalAuthority, matchmissing=:equal) # DF only contains a subset of data in the shp file.
gt = georef(table, LAs.geometry) # No explicit order or match key here?
viz(gt.geometry, color = gt.col)

This produces:


However, the coloured areas are not Manchester, Liverpool, Birmingham, London, Newcastle. Instead, they are a random set of other local authority areas.

Can you please double check if leftjoin from DataFrames.jl preserves the order of rows in the first table?

It does not.

Before leftjoin:

After leftjoin:

I guess I can easily add an index to the data table before joining and then re-sort afterwards.

Yes, or sort the tables before and after the join by all columns.

So, manually restoring the sort order yields the desired result:

I’d have thought this would be a fairly common workflow. It would surely be convenient to be able easily to join a DataFrame to a geotable using one of the data fields.

Thank you for your help!

Tim

1 Like

We will consider adding a normal join as well, wrapping an external package. I know of FlexiJoins.jl and many others to perform joins between normal tables. Given that a GeoTable is also a table, they should work fine.

1 Like

You can control this via the order keyword argument

 order : if :undefined (the default) the order of rows in the
       result is undefined and may change in future releases. If
       :left then the order of rows from the left data frame is
       retained. If :right then the order of rows from the right
       data frame is retained (non-matching rows are put at the
       end).
2 Likes

Thank you! Yes, this saves the superfluous steps!

Actually, this doesn’t work in the general case because

non-matching rows are put at the end

It is necessary for the row order to match exactly for georef to put the data back together with the geometry. I’m sticking with adding an index column to the left DF and then sorting again following the join.

1 Like

Opened an issue: Add `join` for non-geospatial join · Issue #412 · JuliaEarth/GeoStats.jl · GitHub

1 Like

Let me show an alternative with GMT.jl that does not involve joins.

I know about the parallel topic going under Add a colorbar to a viz in GeoStats but given it’s more specific title I don’t want to hijack it.

For this example I downloaded a similar shape file an renamed it .shp.zip to easy up the reading. Other variables are coming from that other thread.

using GMT

# Load the shape file
LAs = gmtread("CTYUA_DEC_2023_UK_BFE.shp.zip");

# Codes for Londonf Boroughs
LADcode=["E09000033", "E09000009", "E09000021", "E09000019", "E09000024", "E09000012", "E09000007", "E09000032", "E09000010", "E09000028", "E09000001", "E09000022", "E09000017", "E09000030", "E09000020", "E09000011", "E09000014", "E09000027", "E09000018", "E09000015", "E09000023", "E09000013", "E09000025", "E09000031", "E09000008", "E09000005", "E09000026", "E09000003", "E09000006", "E09000002", "E09000016", "E09000029", "E09000004"];

# Some other data for the choropleth
Total_Awarded=[1551248258, 110166558, 38033540, 865003501, 46621989, 395602425, 915273792, 197296888, 86772623, 573171051, 321575633, 665770100, 39514648, 374049415, 307431197, 812345843, 181261693, 176262413, 131028168, 48236040, 173721339, 257659057, 246970981, 95473179, 86525712, 304128061, 77681457, 82934564, 88862200, 63604751, 46974062, 42463929, 41183878];

subpop=[205087, 366127, 167845, 216767, 215324, 259956, 210390, 328367, 329601, 306374, 8618, 317498, 304792, 312273, 143940, 289254, 264130, 195232, 287940, 260987, 299810, 183295, 350626, 278050, 390506, 338918, 309836, 388639, 329830, 218534, 262022, 209517, 246543];

# Select only the polygons of the London Boroughs. (Note that in this shapefile the key name is "CTYUA23CD" and not "LADcode")
Dlond = filter(LAs, CTYUA23CD = Tuple(LADcode));

rat = Total_Awarded ./ subpop ./ (sum(Total_Awarded) / sum(subpop));

# Get the extrema of the log10
extrema(log10.(rat))
(-0.9181242361240128, 1.5409978534680409)

# Make a logarithimic colormap with fixed values (not data dependent as requested in that other thread)
C = makecpt(range=(-1,1.6), log=true, cmap="hawaii");

# Show it
viz(Dlond, level=rat, cmap=C, lw=0.1, colorbar=(cmap=C, log=true, frame=:auto))

Thank you for this. I’ve experimented with GMT on my home machine.

Unfortunately, my employer turned down my request for GMT, so I cannot use it for work.

Sorry, don’t understand. If you can use Julia at work and install packages than you can use GMT.jl that installs GMT itself (as an artifact)