Loopless calculations

SergeantMike · January 10, 2024, 3:49pm

As an Ecologist, I have a hard time wrapping my head around some of the more esoteric mathematical voo-doo that Julia does, and by that I mean pretty much anything more complicated than x+y=z. With that said, here’s what I want to do.

I set up a DataFrame:
df=DataFrame(x1 = [0.0028,0.0136,0.0310,0.0342,0.0466], x2 =[0.0009,0.0092,0.0255,0.0525,0.0813], x3 =[0.0089,0.0299,0.0413,0.0773,0.1147])

Now what I would like to do is to have Julia do this calculation:
t=sum(x.*y)/sqrt(sum(x.^2)*sum(y.^2))

on the columns in the DataFrame in the following way

first calculation is
x=df.x1
y=df.x2

Next calculation is
x=df.x1
y=df.x3

Next calculation is
x=df.x2
y=df.x3

Note: this is a minimal set I have actually 6 columns (x1:x6 in this example) and want to continue the pattern of for all 6 columns.

I can do this using a series of for loops but it keeps bugging me somewhere in the dark moldy recesses of my brain (and there are a lot of those) that there was a way to do this without the loops. Any ideas?

Thank you
Mike

Jeff_Emanuel · January 10, 2024, 3:58pm

Change your t to a function:

t(x,y)=sum(x.*y)/sqrt(sum(x.^2)*sum(y.^2))

and then you could use Combinatorics.jl to iterate over the combinations of your column names, and apply the t function. Something like:

results = Dict( (x,y) => t(df[!,x],df[!,y)) for (x,y) in combinations(names(df),2))

This is untested and likely suboptimal but should get you started. Caveat emptor.

And for the record, loops can be clearer than esoteric voodoo. If you write clear loops, you may thank yourself later.

mthelm85 · January 10, 2024, 4:42pm

Here’s an option:

using DataFrames
using Combinatorics

df=DataFrame(
    x1 = rand(5),
    x2 = rand(5),
    x3 = rand(5),
    x4 = rand(5),
    x5 = rand(5),
    x6 = rand(5)
)

t(x,y) = sum(x .* y) / sqrt(sum(x.^2) * sum(y.^2))

combos = combinations([:x1, :x2, :x3, :x4, :x5, :x6], 2)
	
cos_similarities = [t(df[!, c[1]], df[!, c[2]]) for c in combos]

# Result:

15-element Vector{Float64}:
 0.5039552010150775
 0.9043989038369669
 0.8212958683924352
 0.7143122907695525
 0.7599172627439368
 0.5850956757745355
 0.6471346657792967
 0.5647398171102852
 0.7604339270540581
 0.8210921981937371
 0.9221248721177893
 0.8657435871629356
 0.6768400199441867
 0.6873007136174918
 0.9183546953980369

Or, if you want a matrix:

cos_sim_matrix = [t(df[!, i], df[!, j]) for i in 1:ncol(df), j in 1:ncol(df)]

6×6 Matrix{Float64}:
 1.0       0.503955  0.904399  0.821296  0.714312  0.759917
 0.503955  1.0       0.585096  0.647135  0.56474   0.760434
 0.904399  0.585096  1.0       0.821092  0.922125  0.865744
 0.821296  0.647135  0.821092  1.0       0.67684   0.687301
 0.714312  0.56474   0.922125  0.67684   1.0       0.918355
 0.759917  0.760434  0.865744  0.687301  0.918355  1.0

simsurace · January 10, 2024, 4:45pm

Isn’t what you want just the correlation matrix (without subtracting the mean)?

rafael.guerra · January 10, 2024, 4:58pm

A simple version with loops that provides a nice vector of tuples output:

[(i, j, t(df[!,i], df[!,j])) for i in 1:ncol(df) for j in (i+1):ncol(df)]

SergeantMike67 · January 10, 2024, 5:44pm

I don’t really know. I find a lot of my bottlenecks stem from not having the language to communicate effectively enough to even do a search online.

SergeantMike67 · January 10, 2024, 5:54pm

rafael.guerra:

Jeff_Emanuel:

And for the record, loops can be clearer than esoteric voodoo. If you write clear loops, you may thank yourself later.

A simple version with loops that provides a nice vector of tuples output:
[(i, j, t(df[!,i], df[!,j])) for i in 1:ncol(df) for j in (i+1):ncol(df)]

I understand that loops can be clearer but in all honesty there is more than one learning outcome I’m trying to accomplish besides getting the answer. I am trying to learn some of that Julian Voodoo and therefore make future code better and faster. Loops would have got it done but as I like to call it it is the Excel (as in spreadsheet) way of approaching the problem. Do each step of solving the equation in one cell then arrive at an answer after several columns and sheets making it very loopy and at times very confusing.

Jeff_Emanuel · January 10, 2024, 6:22pm

Of course, which is why I showed some voodoo. On the flip side, some concise construct may be quicker to comprehend than parsing and digesting several lines of loop code. There’s always tradeoffs.

yha · January 10, 2024, 11:19pm

For a matrix result, you can also use StatsBase.pairwise:

using StatsBase
pairwise(t, eachcol(df); symmetric = true)

passing symmetric = true avoid recalculating each pair in the opposite order, since your function t is symmetric.
Or to get tuples of column names and t values:

pairwise(propertynames(df); symmetric = true) do n1, n2
    (n1, n2, t(df[!,n1], df[!,n2]))
end

giordano · January 11, 2024, 1:46am

I know you said you don’t like math magic voodoo, but

t(x,y)=sum(x.*y)/sqrt(sum(x.^2)*sum(y.^2))

can be simplified (and substantially sped up) with

using LinearAlgebra
t(x,y)=dot(x, y)/sqrt(dot(x, x)*dot(y, y))

by using the dot function from the LinearAlgebra standard library, which computes the scalar product of two vectors, operation which is highly optimised by BLAS libraries.

SergeantMike · January 11, 2024, 10:50pm

To all of you who responded, I thank you very much for the knowledge. I lave learned much and plan to implement in my next fun Julian adventure. The Julia community is very gracious and generous with their knowledge. I knew Julia was the right community for me.

SergeantMike · January 11, 2024, 10:57pm

giordano:

I know you said you don’t like math magic voodoo, but
t(x,y)=sum(x.*y)/sqrt(sum(x.^2)*sum(y.^2))
can be simplified (and substantially sped up) with
using LinearAlgebra
t(x,y)=dot(x, y)/sqrt(dot(x, x)*dot(y, y))
by using the dot function from the LinearAlgebra standard library, which computes the scalar product of two vectors, operation which is highly optimised by BLAS libraries.

It’s not that I don’t like math magic voodoo, my brain isn’t wired like that. I am an evolutionary ecologist and my math interest consists of

Male + Female = Offspring.

Its how the offspring is different from mom and dad and what did the environment have to do with it that my mind ponders. This is a bit different that math.

All that said, I will try your solution and hopefully learn something from it I can use later on.

stevengj · January 11, 2024, 11:27pm

Better to use dot(x, y) / (norm(x) * norm(y)), which avoids spurious overflow/underflow if x or y have norms that are bigger/smaller than about 10^{\pm 150} (though it is slower – trading off speed for safety). (If you don’t care about this, sum(abs2, x) is likely to be faster than dot(x,x).)

Topic		Replies	Views
Sum the columns in a dataframe New to Julia	6	8779	June 24, 2021
Iterate over all columns in a DataFrame New to Julia dataframes	3	3029	May 24, 2021
Iterative, looping split-apply-combines in Julia General Usage	17	906	November 24, 2020
Iterating over a DataFrame New to Julia iterative , dataframes , function	2	708	May 26, 2021
Efficient way to add column to dataframe computed from prior columns New to Julia question	6	2248	August 12, 2021

Loopless calculations

Related topics