Hey Geo-computing experts. I’m working on a project involving migration of people between locations. We started with German data, where coordinates were already taken care of for me, and now I’ve got data from US Census on the counties of the US. It includes latitude and longitude of the centroids of the counties. I’m interested in the contiguous 48 states, so I don’t need to do Alaska to Florida or Hawaii to Michigan type calculations… How can I transform this into a coordinate system where I can calculate euclidean distance between (x,y) pairs and get a proper distance (say, in km) accurate to less than a couple percent?

It’s clear to me that Geodesy is extremely powerful and can do all kinds of things, but this is about the simplest thing I can think of, and I just don’t have the background to know what’s the easy path… I’m good if it’s accurate to a couple percent, I’m not laying out bridge construction surveys that need to be accurate to 1 cm in 10km or whatever.

If you mean “what is a Cartesian/Euclidian” 2D map where the distance between each pair of points is the same as on a sphere (or a part of it), then that does sound simple indeed, but it’s just not possible to find such a map. That’s why all maps (or “projections”) of Earth look distorted in some place.

If the region of the Earth is small, one can definitely find a projection where the maximal discrepancy between the (x,y) distance to the true distance on the sphere will be smaller than X% where X will depend on how large the region is and what projection has been chosen. E.g. the Robinson projection has low distortion at the equator. If you rotate the projection such that the US are in the “center”, it will have low distortion there and one can estimate the maximal error by taking the latitude/longitude range of the US and comparing the “true” distance to the one computed.

But if you don’t have to, I would avoid projecting anything altogether. As @rafael.guerra mentioned, there are formulas to compute the distance between the points on a sphere if you already have latitude and longitude. The result is still approximate because Earth is not a perfect sphere of course.

Yes, I’m aware we can’t project the whole earth to a flat map, but I had hoped to project the whole of the continental US to a flat map without too much distortion.

There’s two reasons for this. The first is to calculate distances… and there the Haversine function would be good so maybe that’s fine. And the second is to fit a polynomial to the x,y coordinates and define a kind of “desirability index” as a function of two coordinates. Now I could use the lat/long but it seemed to me likely that the surface would have a kind of unequal “resolving power” for desirability differences in nearby counties depending on the region where the counties were. Maybe my better solution is to expand the surface in terms of the lat/long using a different orthogonal polynomial. I’m not looking to do the entire surface of the earth, so spherical harmonics seemed overkill. My idea was to use ApproxFun’s Chebyshev polynomials on a window of x,y range.

@dlakelan the simplest thing you can do to start experimenting with the data is the following:

using GeoStats
using GeoIO
geotable = GeoIO.load("data.format")
projected = geotable |> Proj(utmnorth(10)) # zone 10 to 19 cover the US

You can then visualize the flat map with:

import GLMakie as Mke
projected |> viewer

And explore the contents of the geometry column to compute distances with any method of choice. As people pointed out, the Haversine from Distances.jl is one option when you have lat/lon coordinates. There is also Geodesics.jl for geodescis, though I didn’t test it.

Note:

Geodesy.jl is not actively developed (>2 years without a non-trivial commit) and is much behind in terms of features compared to more recent efforts that we are actively working on, i.e., CoordRefSystems.jl

The data for the counties isn’t in any kind of geo-data special format or anything, it’s just a delimited text file with things like postal code, ANSI code, name, areas, etc. I’m basically starting from a Data Frame with lat,long as columns.

I haven’t read GDSJL but it sounds cool. I’m looking to fit the polynomial myself as part of a large bayesian model in Turing (and get posterior distribution of the coefficients / desirabilities). I’ll look at what kind of thing GeoStats does under the hood. Thanks for the hints.

I didn’t see this function in the GeoStats.jl repository anywhere (I searched GeoStats.jl github for “trend” and also after using GeoStats there was no GeoStats.trend function). Is it in a separate package? Or something that is being developed and not available yet?

Interesting. The quantity I’m interested in is an unobserved parameter so it will take a different approach I guess (obviously I can’t put that thing in a table of observations). This stuff looks really great, but I did see as I was reading the code it seems you use a Vandermonde matrix, but then do naive least squares, which I think may be numerically unstable, and/or there’s some special numerical methods for handling this problem. Just something I noticed and might want a look at. It’s also possible that Julia is hiding some special algorithms because F may have some special type I suppose.

F = polymat(xs, degree) ## this seems to be a Vandermonde matrix
# eqs 25 and 26 in Menafoglio, A., Secchi, P. 2013.
ms = map(vars) do var
z = Tables.getcolumn(cols, var)
a = (F'F \ F') * z
F * a
end

Yes, you will probably need to adjust to your needs and data structures. The methods we provide are for GeoTable. If you come up with a more stable method, please feel free to contribute back. Hope you can find the solution you need somewhere.

Yeah, this is reaching way back to classes I took in the 90’s so I’m not sure I’m the right one to come up with it. But the usual method I think is to use orthogonal polynomials, which is what R does when you regress y ~ poly(x,n) in the lm function. The columns of these matrices are then orthogonal and the condition number can be dramatically better. If GLM.jl offered this then it’d be an easy way to do the calculation, but unfortunately I don’t think they have an orthogonal polynomial basis function yet.

describes how basically you can construct such a basis by a recursive formula. The best plan would be to get that into GLM.jl and then utilize GLM in the regression.

Anyway this is a diversion. Thanks for the hints I’ll come back and let people know how I finally made it work.

Given a set of coefficients (chosen by an MCMC sampler) I want to calculate the value of the implied 2D polynomial surface at each of the centroids of the US counties. I want to do this in a stable way. My plan was to use ApproxFun either directly on the lat/long or on some other more “euclidean” coordinates.

I also want to get distances between centroids for all pairs in some set (ie. in the dataset)

I wondered if there was some set of coordinates I could transform to so that I can both get the distances by norm(x-y) and fit the polynomials on those coordinates so that the coordinates act more “flat” with respect to variation in the input value of the polynomial.

So I either want a suggestion for an alternative set of coordinates to lat/long which behave more euclideanly for the 48 contiguous US states, or I’ll use the Haversine formula for distances and just do the polynomial on the lat/long directly.

In this case the goal is not to have some data we know, and then to find the coefficients that best fit the data. It’s the exact opposite. Have some coefficients we know and to find the value of a function at each point (the value of the latent variables).

It seems that all you need is really a package for fitting polynomials? As soon as you “project” the data into a flat space, you can operate on it as if everything was Cartesian for your purposes.

Right, the polynomial part is the part I understand, ApproxFun provides great objects that represent functions of two variables via Chebyshev polynomials.

The part I don’t understand is what’s a good coordinate system to replace lat/lon for the purposes of the 48 states? Or maybe I just in the end use lat/lon.

An example of why a euclidean coordinate system might be better is that I might for example want in the model to find points “within 100km” of the point of interest and then average certain things over those other counties, or for example compute the difference between P(x,y) and P(x2,y2) where x2,y2 is a point with a certain distance offset in a certain direction.

My biggest concern is these calculations will be in the “hot loop” so to speak. Imagine computing some number for each of say 250,000 observations each time through a log-likelihood evaluation and needing maybe a million log likelihood evaluations with derivatives for sampling, or maybe 1000 evaluations for an optimization run (preferably with derivatives).

Doing 10^11 intense coordinate transforms is going to be slow and messy for derivatives if it works at all. On the other hand adding two numerical vectors [x,y] + [dx,dy] in an already euclideanized coordinate is simple and easily differentiated.

I suspect many people have more concern about the geospatial accuracy and soforth of measured data points. I’m more concerned with say averaging the value of F(x,y) in a neighborhood of x,y where the neighborhood is a certain physical size (in say km) and accuracy is relatively less important but speed is almost everything.

Any suggestions for a coordinate scheme for the continental US?