Distances and Similitude

Hello Colleagues:

I am attempting to understand the concepts
behind some of the Distances.jl modules. So
will start with a basic DF for example:

DF = Dataframe(ID = 1:1:15, M1 = rand(0:5:100,15), M2 = rand(0:5:100,15),
                          M3 = rand(0:5:100,15), Out1 = 1:10:100, Out2 = 1:10:100,
                                           Out3 = 1:10:100)

Treat the [“Out1”, “Out2”, “Out3”] columns as the comparison benchmark/standard.
For example, at index position (row) 2: Out1 = 11, Out2 = 11, Out3 = 11 – the
corresponding [“M1”, “M2”, “M3”] values at index position 2: M1 = 15, M2 = 40,
M3 = 25.

How might I estimate which ID value and accompanying [“M1”, “M2”, “M3”] has the
least and most variance/similitude/distance (not sure if these terms can be used interchangeably) to the [“Out1”, “Out2”, “Out3”] columns?

Background Distances · Julia Packages

You may be interested in

https://github.com/JuliaML/TableDistances.jl

You can select the columns of the data frame and call pairwise to produce a nrow x nrow distance matrix. The main advantage here is that the package will choose an appropriate distance depending on the scientific type of the column, so it works with categorical data, composicional data, etc.

1 Like

@juliohm Thank you Julio.

In the resultant matrix from
TableDistances.jl, the value
of index (1,1) of “0.0” is the
distance to what reference
point?

Also, if I were to use ncol,
how might that influence
the resultant matrix?

Thank you,

The pairwise distance matrix has entries i, j with distances between row i and row j of the table.

Don’t understand your question about ncol.

@juliohm Thanks for your reply.

Let me clarify, when your instructions
from the link you provided say:

compute the pairwise distance between rows

Is there a way to compute the pairwise distances
between columns or a specified subset of columns?

In our example here,

Row 1: M1(value) … M2 (value) … M3 (value) … “distance to” … Out1(val) … Out2(val) … Out3(val)
Row 2: M1(value) … M2 (value) … M3 (value) … “distance to” … Out1(val) … Out2(val) … Out3(val)

Then finding the Row that has the lowest/highest cumulative distances between M and Out (i.e., M1 to Out1, M2 to Out2)

If you want to compute distance between columns (assuming they all have the same scientific type), you can Matrix(df) and call pairwise from Distances.jl directly.

@juliohm

I think a visual representation of the data I am using
might provide a better MWE.

From the figure below:

The numbers in the green box represent the reference
row (vector).

The green arrows point to the value within the column that are
closest to the reference column value

The lite red arrows point to the value within the column that are
the furthest to the reference column value

The data here is homogenous. Is there a way to display the
distances from the reference row at each row-column position
for each row above the reference row?

@juliohm
I transposed the table above and added some arbitrary labels
such that:

In this case, is there a way to visualize/depict a table that shows
the column member (SING, SWE TAI) that has either the closest
or furthest distance from the reference value (40)? I would like to
do this for each row value in the USA column.

@YummyPampers2 you have many different ways to achieve what you want. You just need to choose an algorithm and implement it yourself. Is there any problem with writing for loops and computing distances? Can you explain why you can’t use Distances.jl or TableDistances.jl to achieve your ultimate goal?

@juliohm

I was attempting to implement something
simply such as:

(-).([1 2 3; 4 5 6; 7 8 9], [1; 2; 3])

The resultant matrix should be okay.
However, I thought there was a more
extensible way to achieve this with
Distances.jl.

Also, when I ran – evaluate(dist, x, y),
‘dist’ is not recognized. Do I need to
prefix it or is is dist a representation for
something else?

This is what Distances.jl does, have you checked the pairwise and colwise functions there?

EDIT:

I noticed that you are computing differences not distances. Why you aren’t satisfied with the code above with -? Seems completely fine?

You need to define the dist object, the README is just giving an example, try dist = Euclidean() for example. But I don’t think you need distances after all if all you need is computing differences between scalars.

Okay – I will experiment with the other
methods, but yes, the general goal is
to calculate differences and compare
those differences singularly. Meaning,
for the entire range, I would like to show:

    Least   Most
1:    SIN    TAI
2:    TAI    SWE
3:    SWE    SIN

@juliohm

The limitation with setdiff, dist, etc…
are that they only deal with pairs. I
was looking for a way to find the
differences for more than two vars.
and displaying the result in a table
(preferably dataframe)