Loopless calculations

As an Ecologist, I have a hard time wrapping my head around some of the more esoteric mathematical voo-doo that Julia does, and by that I mean pretty much anything more complicated than x+y=z. With that said, here’s what I want to do.

I set up a DataFrame:
df=DataFrame(x1 = [0.0028,0.0136,0.0310,0.0342,0.0466], x2 =[0.0009,0.0092,0.0255,0.0525,0.0813], x3 =[0.0089,0.0299,0.0413,0.0773,0.1147])

Now what I would like to do is to have Julia do this calculation:
t=sum(x.*y)/sqrt(sum(x.^2)*sum(y.^2))

on the columns in the DataFrame in the following way

first calculation is
x=df.x1
y=df.x2

Next calculation is
x=df.x1
y=df.x3

Next calculation is
x=df.x2
y=df.x3

Note: this is a minimal set I have actually 6 columns (x1:x6 in this example) and want to continue the pattern of for all 6 columns.

I can do this using a series of for loops but it keeps bugging me somewhere in the dark moldy recesses of my brain (and there are a lot of those) that there was a way to do this without the loops. Any ideas?

Thank you
Mike

Change your t to a function:

t(x,y)=sum(x.*y)/sqrt(sum(x.^2)*sum(y.^2))

and then you could use Combinatorics.jl to iterate over the combinations of your column names, and apply the t function. Something like:

results = Dict( (x,y) => t(df[!,x],df[!,y)) for (x,y) in combinations(names(df),2))

This is untested and likely suboptimal but should get you started. Caveat emptor.

And for the record, loops can be clearer than esoteric voodoo. If you write clear loops, you may thank yourself later.

6 Likes

Here’s an option:

using DataFrames
using Combinatorics

df=DataFrame(
    x1 = rand(5),
    x2 = rand(5),
    x3 = rand(5),
    x4 = rand(5),
    x5 = rand(5),
    x6 = rand(5)
)

t(x,y) = sum(x .* y) / sqrt(sum(x.^2) * sum(y.^2))

combos = combinations([:x1, :x2, :x3, :x4, :x5, :x6], 2)
	
cos_similarities = [t(df[!, c[1]], df[!, c[2]]) for c in combos]

# Result:

15-element Vector{Float64}:
 0.5039552010150775
 0.9043989038369669
 0.8212958683924352
 0.7143122907695525
 0.7599172627439368
 0.5850956757745355
 0.6471346657792967
 0.5647398171102852
 0.7604339270540581
 0.8210921981937371
 0.9221248721177893
 0.8657435871629356
 0.6768400199441867
 0.6873007136174918
 0.9183546953980369

Or, if you want a matrix:

cos_sim_matrix = [t(df[!, i], df[!, j]) for i in 1:ncol(df), j in 1:ncol(df)]

6×6 Matrix{Float64}:
 1.0       0.503955  0.904399  0.821296  0.714312  0.759917
 0.503955  1.0       0.585096  0.647135  0.56474   0.760434
 0.904399  0.585096  1.0       0.821092  0.922125  0.865744
 0.821296  0.647135  0.821092  1.0       0.67684   0.687301
 0.714312  0.56474   0.922125  0.67684   1.0       0.918355
 0.759917  0.760434  0.865744  0.687301  0.918355  1.0

Isn’t what you want just the correlation matrix (without subtracting the mean)?

2 Likes

A simple version with loops that provides a nice vector of tuples output:

[(i, j, t(df[!,i], df[!,j])) for i in 1:ncol(df) for j in (i+1):ncol(df)]
4 Likes

I don’t really know. I find a lot of my bottlenecks stem from not having the language to communicate effectively enough to even do a search online.

I understand that loops can be clearer but in all honesty there is more than one learning outcome I’m trying to accomplish besides getting the answer. I am trying to learn some of that Julian Voodoo and therefore make future code better and faster. Loops would have got it done but as I like to call it it is the Excel (as in spreadsheet) way of approaching the problem. Do each step of solving the equation in one cell then arrive at an answer after several columns and sheets making it very loopy and at times very confusing.

Of course, which is why I showed some voodoo. On the flip side, some concise construct may be quicker to comprehend than parsing and digesting several lines of loop code. There’s always tradeoffs.

For a matrix result, you can also use StatsBase.pairwise:

using StatsBase
pairwise(t, eachcol(df); symmetric = true)

passing symmetric = true avoid recalculating each pair in the opposite order, since your function t is symmetric.
Or to get tuples of column names and t values:

pairwise(propertynames(df); symmetric = true) do n1, n2
    (n1, n2, t(df[!,n1], df[!,n2]))
end
2 Likes

I know you said you don’t like math magic voodoo, but

t(x,y)=sum(x.*y)/sqrt(sum(x.^2)*sum(y.^2))

can be simplified (and substantially sped up) with

using LinearAlgebra
t(x,y)=dot(x, y)/sqrt(dot(x, x)*dot(y, y))

by using the dot function from the LinearAlgebra standard library, which computes the scalar product of two vectors, operation which is highly optimised by BLAS libraries.

4 Likes

To all of you who responded, I thank you very much for the knowledge. I lave learned much and plan to implement in my next fun Julian adventure. The Julia community is very gracious and generous with their knowledge. I knew Julia was the right community for me.

2 Likes

It’s not that I don’t like math magic voodoo, my brain isn’t wired like that. I am an evolutionary ecologist and my math interest consists of

Male + Female = Offspring.

Its how the offspring is different from mom and dad and what did the environment have to do with it that my mind ponders. This is a bit different that math.

All that said, I will try your solution and hopefully learn something from it I can use later on.

1 Like

Better to use dot(x, y) / (norm(x) * norm(y)), which avoids spurious overflow/underflow if x or y have norms that are bigger/smaller than about 10^{\pm 150} (though it is slower – trading off speed for safety). (If you don’t care about this, sum(abs2, x) is likely to be faster than dot(x,x).)

3 Likes