As an Ecologist, I have a hard time wrapping my head around some of the more esoteric mathematical voo-doo that Julia does, and by that I mean pretty much anything more complicated than x+y=z. With that said, here’s what I want to do.
I set up a DataFrame: df=DataFrame(x1 = [0.0028,0.0136,0.0310,0.0342,0.0466], x2 =[0.0009,0.0092,0.0255,0.0525,0.0813], x3 =[0.0089,0.0299,0.0413,0.0773,0.1147])
Now what I would like to do is to have Julia do this calculation: t=sum(x.*y)/sqrt(sum(x.^2)*sum(y.^2))
on the columns in the DataFrame in the following way
first calculation is
x=df.x1
y=df.x2
Next calculation is
x=df.x1
y=df.x3
Next calculation is
x=df.x2
y=df.x3
Note: this is a minimal set I have actually 6 columns (x1:x6 in this example) and want to continue the pattern of for all 6 columns.
I can do this using a series of for loops but it keeps bugging me somewhere in the dark moldy recesses of my brain (and there are a lot of those) that there was a way to do this without the loops. Any ideas?
I understand that loops can be clearer but in all honesty there is more than one learning outcome I’m trying to accomplish besides getting the answer. I am trying to learn some of that Julian Voodoo and therefore make future code better and faster. Loops would have got it done but as I like to call it it is the Excel (as in spreadsheet) way of approaching the problem. Do each step of solving the equation in one cell then arrive at an answer after several columns and sheets making it very loopy and at times very confusing.
Of course, which is why I showed some voodoo. On the flip side, some concise construct may be quicker to comprehend than parsing and digesting several lines of loop code. There’s always tradeoffs.
For a matrix result, you can also use StatsBase.pairwise:
using StatsBase
pairwise(t, eachcol(df); symmetric = true)
passing symmetric = true avoid recalculating each pair in the opposite order, since your function t is symmetric.
Or to get tuples of column names and t values:
pairwise(propertynames(df); symmetric = true) do n1, n2
(n1, n2, t(df[!,n1], df[!,n2]))
end
I know you said you don’t like math magic voodoo, but
t(x,y)=sum(x.*y)/sqrt(sum(x.^2)*sum(y.^2))
can be simplified (and substantially sped up) with
using LinearAlgebra
t(x,y)=dot(x, y)/sqrt(dot(x, x)*dot(y, y))
by using the dot function from the LinearAlgebra standard library, which computes the scalar product of two vectors, operation which is highly optimised by BLAS libraries.
To all of you who responded, I thank you very much for the knowledge. I lave learned much and plan to implement in my next fun Julian adventure. The Julia community is very gracious and generous with their knowledge. I knew Julia was the right community for me.
It’s not that I don’t like math magic voodoo, my brain isn’t wired like that. I am an evolutionary ecologist and my math interest consists of
Male + Female = Offspring.
Its how the offspring is different from mom and dad and what did the environment have to do with it that my mind ponders. This is a bit different that math.
All that said, I will try your solution and hopefully learn something from it I can use later on.
Better to use dot(x, y) / (norm(x) * norm(y)), which avoids spurious overflow/underflow if x or y have norms that are bigger/smaller than about 10^{\pm 150} (though it is slower – trading off speed for safety). (If you don’t care about this, sum(abs2, x) is likely to be faster than dot(x,x).)