I have a working understanding of principles of distances between datasets at a fairly high level (ie perhaps not all the detailed maths). I want to calculate the Mahalanobis distance between a reference dataset and an observations dataset. To understand what I am getting out of the Distances.jl package I decided to try and recreate this example (originally in Python - but it’s a worked example with a result I can compare mine to).
My first issue is that, given the variables are in the columns and the observations are in the rows (as you might expect), attempting to use colwise
in distances.jl results in an error that there is a dimensions mismatch (presumably because there is a different number of rows between the reference and the dataset and the observations (which is just the first 500 rows in this case).
So, I tried turning the data around with permutedims
and using pairwise
. This works but now givens me an enormous table 500x53940 with 0.0 down the diagonal and nothing that matches the results I see in the above link.
The code I have is as follows (variable names match the linked example):
using DataFrames, CSV, Distances, Statistics
df = DataFrame(CSV.File("A:/diamonds.csv"))
df_x = Array(select(first(df, 500), [:carat, :depth, :price]))
data = Array(select(df, [:carat, :depth, :price]))
Q = cov(data)
colwise(Mahalanobis(Q), df_x, data) # ERROR: DimensionMismatch("Incorrect vector dimensions.")
df_xp = permutedims(df_x)
datap = permutedims(data)
pairwise(Mahalanobis(Q), df_xp, datap) # massive incompressible array
I clearly don’t understand something here and would very much appreciate help to do so. Given the diamonds CSV in the above think, how would I compute the Mahalanobis distance between the first 500 rows and the dataset as a whole (which is what they are doing in the link) and how do I interpret the result? Why can’t I do this column-wise (which seems to be the logical thing to do)?
Many thanks!
2 Likes
What size array did you expect to get? The result does not surprise me if you have 500 entries in one dataset and 54k in the other. The reason that it’s zeros on the diagonal is that you use the same 500 points in both datasets.
Entry [i, j] in the distance matrix will be the distance between datapoint i in the first set and j in the other.
1 Like
Thanks for the reply. I was expecting a distance between each variable (column) and the envelope of the reference set. That would be 500 x 3 in this case rather than a distance for every single observation and each member of the reference set because that’s the point of the Mahalanobis distance.
I figured that the zeros on the diagonal was for the reason you say and that makes perfect sense but what I was hoping to understand was why the Python example gets a single value (that appears to come for the diagonal) for the distance of each observation to the reference set. I have seen other examples worked in Excel and elsewhere that also return a single value for Mahalanobis distance, so I presumed there was a method to ‘collapse’ all the distances to a single value (root mean square of a row of distances - is that legitimate?).
EDIT: I implemented the python example exactly as it is in the link and I get a 500 x 500 array out, from which they are taking the diagonal.
EDIT 2: It turns out that there is an alternative version of the Mahalanobis formula that which uses distances from each observation to the central mean, which appears to be what the Python version is doing from looking at the code. This explains the dramatically different results but begs the question how we can call both these things “Mahalanobis distance”?
@Jasper_Hall I think you should clear your mind what you are expecting with that it is not mater what you enter. Enter [i,j] matrix in this space.
Thanks for the comments folks. In the end I decided to ditch the black-box of Distances.JL and port the code in the example to Julia. The only gotcha in porting is that we seem to need to do more transformations of the matrices in Python that are required in Julia, especially when deriving the covariant matrix. Otherwise, it is very straight forward and gives me what I expect and I can compare my results to a number of other different implementations too, so I know it is working correctly.