Distances and Similitude

YummyPampers2 · December 3, 2021, 12:33pm

Hello Colleagues:

I am attempting to understand the concepts
behind some of the Distances.jl modules. So
will start with a basic DF for example:

DF = Dataframe(ID = 1:1:15, M1 = rand(0:5:100,15), M2 = rand(0:5:100,15),
                          M3 = rand(0:5:100,15), Out1 = 1:10:100, Out2 = 1:10:100,
                                           Out3 = 1:10:100)

Treat the [“Out1”, “Out2”, “Out3”] columns as the comparison benchmark/standard.
For example, at index position (row) 2: Out1 = 11, Out2 = 11, Out3 = 11 – the
corresponding [“M1”, “M2”, “M3”] values at index position 2: M1 = 15, M2 = 40,
M3 = 25.

How might I estimate which ID value and accompanying [“M1”, “M2”, “M3”] has the
least and most variance/similitude/distance (not sure if these terms can be used interchangeably) to the [“Out1”, “Out2”, “Out3”] columns?

YummyPampers2 · December 3, 2021, 12:35pm

Background Distances · Julia Packages

juliohm · December 3, 2021, 12:54pm

You may be interested in

https://github.com/JuliaML/TableDistances.jl

You can select the columns of the data frame and call pairwise to produce a nrow x nrow distance matrix. The main advantage here is that the package will choose an appropriate distance depending on the scientific type of the column, so it works with categorical data, composicional data, etc.

YummyPampers2 · December 4, 2021, 6:40am

@juliohm Thank you Julio.

In the resultant matrix from
TableDistances.jl, the value
of index (1,1) of “0.0” is the
distance to what reference
point?

Also, if I were to use ncol,
how might that influence
the resultant matrix?

Thank you,

juliohm · December 4, 2021, 10:14am

The pairwise distance matrix has entries i, j with distances between row i and row j of the table.

Don’t understand your question about ncol.

YummyPampers2 · December 4, 2021, 6:27pm

@juliohm Thanks for your reply.

Let me clarify, when your instructions
from the link you provided say:

compute the pairwise distance between rows

Is there a way to compute the pairwise distances
between columns or a specified subset of columns?

In our example here,

Row 1: M1(value) … M2 (value) … M3 (value) … “distance to” … Out1(val) … Out2(val) … Out3(val)
Row 2: M1(value) … M2 (value) … M3 (value) … “distance to” … Out1(val) … Out2(val) … Out3(val)

Then finding the Row that has the lowest/highest cumulative distances between M and Out (i.e., M1 to Out1, M2 to Out2)

juliohm · December 5, 2021, 11:33am

If you want to compute distance between columns (assuming they all have the same scientific type), you can Matrix(df) and call pairwise from Distances.jl directly.

YummyPampers2 · December 7, 2021, 7:13pm

@juliohm

I think a visual representation of the data I am using
might provide a better MWE.

From the figure below:

The numbers in the green box represent the reference
row (vector).

The green arrows point to the value within the column that are
closest to the reference column value

The lite red arrows point to the value within the column that are
the furthest to the reference column value

The data here is homogenous. Is there a way to display the
distances from the reference row at each row-column position
for each row above the reference row?

YummyPampers2 · December 7, 2021, 8:56pm

@juliohm
I transposed the table above and added some arbitrary labels
such that:

In this case, is there a way to visualize/depict a table that shows
the column member (SING, SWE TAI) that has either the closest
or furthest distance from the reference value (40)? I would like to
do this for each row value in the USA column.

juliohm · December 7, 2021, 9:03pm

@YummyPampers2 you have many different ways to achieve what you want. You just need to choose an algorithm and implement it yourself. Is there any problem with writing for loops and computing distances? Can you explain why you can’t use Distances.jl or TableDistances.jl to achieve your ultimate goal?

YummyPampers2 · December 7, 2021, 9:06pm

@juliohm

I was attempting to implement something
simply such as:

(-).([1 2 3; 4 5 6; 7 8 9], [1; 2; 3])

The resultant matrix should be okay.
However, I thought there was a more
extensible way to achieve this with
Distances.jl.

Also, when I ran – evaluate(dist, x, y),
‘dist’ is not recognized. Do I need to
prefix it or is is dist a representation for
something else?

juliohm · December 7, 2021, 9:09pm

This is what Distances.jl does, have you checked the pairwise and colwise functions there?

EDIT:

I noticed that you are computing differences not distances. Why you aren’t satisfied with the code above with -? Seems completely fine?

juliohm · December 7, 2021, 9:12pm

You need to define the dist object, the README is just giving an example, try dist = Euclidean() for example. But I don’t think you need distances after all if all you need is computing differences between scalars.

YummyPampers2 · December 7, 2021, 9:17pm

Okay – I will experiment with the other
methods, but yes, the general goal is
to calculate differences and compare
those differences singularly. Meaning,
for the entire range, I would like to show:

    Least   Most
1:    SIN    TAI
2:    TAI    SWE
3:    SWE    SIN

YummyPampers2 · December 7, 2021, 9:54pm

@juliohm

The limitation with setdiff, dist, etc…
are that they only deal with pairs. I
was looking for a way to find the
differences for more than two vars.
and displaying the result in a table
(preferably dataframe)

Topic		Replies	Views
Pairwise distances from a single column or vector New to Julia	7	4032	October 2, 2019
[ANN] TableDistances.jl Package Announcements package , announcement , dataframes , machine-learning , tables	1	736	October 16, 2021
Simple user similarity in Julia vs Python New to Julia question , dataframes	4	518	January 26, 2021
Distance matrix + clustering with custom distance function New to Julia question , distances	6	1583	December 18, 2020
Understanding Mahalanobis in Distances.jl General Usage distances	4	1223	April 26, 2021

Distances and Similitude

Related topics