I am attempting to understand the concepts
behind some of the Distances.jl modules. So
will start with a basic DF for example:
DF = Dataframe(ID = 1:1:15, M1 = rand(0:5:100,15), M2 = rand(0:5:100,15),
M3 = rand(0:5:100,15), Out1 = 1:10:100, Out2 = 1:10:100,
Out3 = 1:10:100)
Treat the [“Out1”, “Out2”, “Out3”] columns as the comparison benchmark/standard.
For example, at index position (row) 2: Out1 = 11, Out2 = 11, Out3 = 11 – the
corresponding [“M1”, “M2”, “M3”] values at index position 2: M1 = 15, M2 = 40,
M3 = 25.
How might I estimate which ID value and accompanying [“M1”, “M2”, “M3”] has the
least and most variance/similitude/distance (not sure if these terms can be used interchangeably) to the [“Out1”, “Out2”, “Out3”] columns?
You may be interested in
You can select the columns of the data frame and call
pairwise to produce a
nrow x nrow distance matrix. The main advantage here is that the package will choose an appropriate distance depending on the scientific type of the column, so it works with categorical data, composicional data, etc.
@juliohm Thank you Julio.
In the resultant matrix from
TableDistances.jl, the value
of index (1,1) of “0.0” is the
distance to what reference
Also, if I were to use
how might that influence
the resultant matrix?
The pairwise distance matrix has entries i, j with distances between row i and row j of the table.
Don’t understand your question about ncol.
@juliohm Thanks for your reply.
Let me clarify, when your instructions
from the link you provided say:
compute the pairwise distance between rows
Is there a way to compute the pairwise distances
between columns or a specified subset of columns?
In our example here,
Row 1: M1(value) … M2 (value) … M3 (value) … “distance to” … Out1(val) … Out2(val) … Out3(val)
Row 2: M1(value) … M2 (value) … M3 (value) … “distance to” … Out1(val) … Out2(val) … Out3(val)
Then finding the Row that has the lowest/highest cumulative distances between M and Out (i.e., M1 to Out1, M2 to Out2)
If you want to compute distance between columns (assuming they all have the same scientific type), you can
Matrix(df) and call pairwise from Distances.jl directly.
I think a visual representation of the data I am using
might provide a better MWE.
From the figure below:
The numbers in the green box represent the reference
The green arrows point to the value within the column that are
closest to the reference column value
The lite red arrows point to the value within the column that are
the furthest to the reference column value
The data here is homogenous. Is there a way to display the
distances from the reference row at each row-column position
for each row above the reference row?
I transposed the table above and added some arbitrary labels
In this case, is there a way to visualize/depict a table that shows
the column member (SING, SWE TAI) that has either the closest
or furthest distance from the reference value (40)? I would like to
do this for each row value in the USA column.
@YummyPampers2 you have many different ways to achieve what you want. You just need to choose an algorithm and implement it yourself. Is there any problem with writing for loops and computing distances? Can you explain why you can’t use Distances.jl or TableDistances.jl to achieve your ultimate goal?
I was attempting to implement something
simply such as:
(-).([1 2 3; 4 5 6; 7 8 9], [1; 2; 3])
The resultant matrix should be okay.
However, I thought there was a more
extensible way to achieve this with
Also, when I ran – evaluate(dist, x, y),
‘dist’ is not recognized. Do I need to
prefix it or is is dist a representation for
This is what Distances.jl does, have you checked the
colwise functions there?
I noticed that you are computing differences not distances. Why you aren’t satisfied with the code above with
-? Seems completely fine?
You need to define the
dist object, the README is just giving an example, try
dist = Euclidean() for example. But I don’t think you need distances after all if all you need is computing differences between scalars.
Okay – I will experiment with the other
methods, but yes, the general goal is
to calculate differences and compare
those differences singularly. Meaning,
for the entire range, I would like to show:
1: SIN TAI
2: TAI SWE
3: SWE SIN
The limitation with setdiff, dist, etc…
are that they only deal with pairs. I
was looking for a way to find the
differences for more than two vars.
and displaying the result in a table