Iterative, looping split-apply-combines in Julia

newtojulia1 · November 23, 2020, 10:02pm

I am new to Julia (and the forum - apologies on my formatting!) and struggling to implement complex split-apply-combine functions, in particular when these involve iterative looping across columns. I could really use some help to develop an approach that, applied to different slices of the data, iteratively constructs pair-wise operations across multiple columns. I’ll give an example below that might seem solvable with brute force / manual approaches, but in reality this will be applied to a large array and so an automated/iterative solution is very much needed.

As an example, consider these data:

test=DataFrame(x1=rand(1000), x2=rand(1000), x3=rand(1000), x4=rand(1000))
test.subject=“a”
test[501:1000,5]=“b”

This yields 4 columns of data for two subjects (a, and b), each with 500 rows. I found that the by, combine, and map functions allow a nice solution to apply any given function to any given column by each slice (subject) - but, I’m trying to create a function that iterates across columns simultaneously.

Say for example that the goal is to compute some measure - difference between variable X1 and variable X2, and likewise X1 vs. X3, and X1 vs X4 - and, likewise, will want to do the same for X2 vs X1, and X2 vs, X3 and X2 vs X4 - every possible combination of pairwise differences, while avoiding duplication.

So the pseudocode would be something like:
function dostuff(x,y)
new_var=x-y
end
result=combine(df->dostuff(df.X1, df.X2), groupby(test, [:subject])

…but this obviously only calculates across 1 possible permutation, and I’m confused as to how to even approach automating this in Julia - nest a function call in a For Loop? How do I iteratively cycle through each permutation of columns I need to test? How do I provide the appropriate (and changing) inputs to the function each time it is iterated, i.e. the first “dostuff” would compare X1 and X2, but the next must compare X1 and X3 and so on until all combinations have been done, while avoiding duplication.

Any advice would be appreciated. The final result would be a dataframe that provides the (named) difference calculated in the “dostuff” function for each particular combination. (also I picked a simple subtraction just as an example, the real application involves a more complicated calculation. So if there’s a hardwired column differences function or something that won’t help)

Thanks for considering!

pdeffebach · November 23, 2020, 10:07pm

I’m a bit confused by this. Why do you want to use combine? It looks like all your operations are by row. Does dostuff produce a scalar or a vector?

newtojulia1 · November 23, 2020, 10:27pm

You need a combine because the operation is split/applied by subject, as well as iterated across multiple columns.

pdeffebach · November 23, 2020, 10:32pm

Okay. The hard part is generating the names. This should do what you want

julia> using Statistics, DataFrames;

julia> test=DataFrame(x1=rand(1000), x2=rand(1000), x3=rand(1000), x4=rand(1000));

julia> test.subject="a"
       test[501:1000,5]= "b";

julia> function getproduct(names)
           collect.( (vec ∘ collect ∘ Iterators.product)(names, names))
       end;

julia> function dostuff(x, y)
           mean(x) - mean(y)
       end;

julia> combine(groupby(test, "subject"), getproduct(names(test, Not("subject"))) .=> dostuff)
2×17 DataFrame. Omitted printing of 8 columns
│ Row │ subject │ x1_x1_dostuff │ x2_x1_dostuff │ x3_x1_dostuff │ x4_x1_dostuff │ x1_x2_dostuff │ x2_x2_dostuff │ x3_x2_dostuff │ x4_x2_dostuff │
│     │ String  │ Float64       │ Float64       │ Float64       │ Float64       │ Float64       │ Float64       │ Float64       │ Float64       │
├─────┼─────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┤
│ 1   │ a       │ 0.0           │ 0.000719563   │ 2.31758e-5    │ -0.00244945   │ -0.000719563  │ 0.0           │ -0.000696388  │ -0.00316901   │
│ 2   │ b       │ 0.0           │ 0.0272609     │ 0.0261707     │ 0.0326415     │ -0.0272609    │ 0.0           │ -0.00109015   │ 0.00538059    │

(side note, I invite the reader to think about how hard this would be in dplyr)

FedericoStra · November 23, 2020, 10:33pm

If I understand correctly, you are having troubles expressing an iteration over the pairs (:x1,:x2), (:x1,:x3), (:x1,:x4)…
So, first get the list of the column names. I’m not very familiar with DataFrames, so the only way I could find is

colnames = getfield(test, :colindex).names

but I imagine there is a nicer way. (Edit: the right way is names(test) or propertynames(test))
Next, create a function that returns all possible distinct pairs:

function distinct_pairs(list)
  indices = eachindex(list)
  [(list[i], list[j]) for i in indices for j in indices if i != j]
end

Now you can do something like

for (xname, yname) in distinct_pairs(colnames)
  x = getproperty(test, xname)
  y = getproperty(test, yname)
  # do something with columns `x` and `y`
end

pdeffebach · November 23, 2020, 10:38pm

You can also use names(test). getfield(test, :colindex) is not exposed API and shouldn’t be relied on.

FedericoStra · November 23, 2020, 10:40pm

Thanks, I found also propertynames(test), which returns symbols instead of strings, which is more convenient to then use getproperty. Edit: oh nevermind, getproperty accepts strings as well.

pdeffebach · November 23, 2020, 10:41pm

true, but in general in DataFrames it’s convention to do df[!, name] rather than getproperty.

newtojulia1 · November 23, 2020, 10:42pm

Thanks so much to both of you. I don’t know if I’m more impressed by the speedy solutions, or that you were able to actually parse what I was going for from my confusing post. This is fantastic.

I’m going to play with these solutions and see what works, the actual function I’m working with is more complicated.

And yes, agreed this would be far from simple in dplyr. We previously approached this with nested for loops in Matlab but I was hoping for something a bit more elegant, as each of these approaches offers.

Thanks again. I’m looking into if multiple posts can be marked “solution”

FedericoStra · November 23, 2020, 10:46pm

Thanks for teaching me. I’ve only used a bit of pandas in the past, but never DataFrames, so I was sure I would have butchered the syntax and the conventions (I’ve actually installed DataFrames just to experiment with this question).

Since I realized that a significant portion of the problem was iterating over the pairs, I thought I would post a solution to that part.

newtojulia1 · November 23, 2020, 11:06pm

Can this be extended if the output of the function isn’t a scalar?

So if for example instead of mean(x)-mean(y), the “dostuff” function yielded a named tuple for each iteration with 15 distinct measures?

newtojulia1 · November 23, 2020, 11:09pm

Sorry, I should have clarified that the actual function I’m using yields a NamedTuple with 15 entries.

This yields an error when I try to apply this solution:
“Argument Error: a single value or vector result is required when passing multiple functions (got Array{Any,1})”

Converting this output to a vector or df within the function doesn’t solve.

pdeffebach · November 24, 2020, 12:54am

This is possible with DataFrames in the most recently released version, 0.22. What you want to do is transform the source to be AsTable.(getproduct(...)) .=> dostuff => AsTable

You will have to change dostuff so that it accepts a NamedTuple of vectors. You have to make this change because with this change, you will need to handle the names of the output on your own, so your function needs to see the names of the inputted columns. You do this via the AsTable wrapper in the source.

newtojulia1 · November 24, 2020, 1:32am

Appreciate the help, again!

I’m going to mark your previous reply as the solution as this ultimately led to the implementation I’m working with now, which does the trick.

Until I wrap my head around the naming/structure conventions, what I’ve done for now is include an extra little conversion step in my function, which outputs the Tuple as a vector.

For example given the “dostuff” function we’ve been discussing, which generates the problematic 15-item Tuple, I can still get the output vector I need if I include an extra step like:

function dostuff(x, y)
res=otherfunctionwith15outputs(x,y)
[i for i in res] #this is the modification
end;

…which yields my desired output in a “long form”, where for each subject I have 15 rows of output, each corresponding to the 15 lines of the Tuple. I can recover the output names with \ step, acting on the dataframe returned by the combine statement.

So: a workaround that functions for now, and good advice to focus on improving the structure in the future.

Thanks again!

pdeffebach · November 24, 2020, 1:54pm

Oh sorry, yes I read named tuple, implying you wanted many columns for output.

Yes, Tuples do not get “stacked” after a combine call. Only vectors get stacked. You can do

combine(gd, :x => (collect ∘ dostuff) => :y)

if dostuff returns a Tuple.

newtojulia1 · November 24, 2020, 3:17pm

Sounds like a cleaner solution than what I came up with, but I’m not sure I’m following.

I think I’m missing what the “gd” in your combine statement is referencing, and how to maintain the columnar interactions (referenced per the getproduct function you shared) as well as the per-subject split-apply-combine?

function getproduct(names)
collect.( (vec ∘ collect ∘ Iterators.product)(names, names))
end;

function dostuff(x, y)
res=functionthatreturns15itemNamedTuple(x,y)
[i for i in a]
end;

#this returns, per “subject”, 15 rows of data, with columns for each column interaction as >#defined in get-product
result=combine(groupby(test, “subject”), getproduct(names(test, Not(“subject”))) .=> newrun)

#this is a vector of strings with the labels for the rows of the Tuple
labels=["item 1’, “item2”…]

#this creates a new variable, Measure, in my results data frame

this assigns, for each subject, the measure labels for each item of the Tuple

result.measure=repeat(labels, length(unique(result.subject)))

pdeffebach · November 24, 2020, 3:28pm

The gd just means groupby(test, "subject"), i.e. the grouped data frame.

I think you are on the right track. I would just put the labels in your original combine call, i.e.

combine(groupby(test, "subject"), [] => (()->  labels) => "label", getproduct()...)

newtojulia1 · November 24, 2020, 3:45pm

Ah, I see - yes, the conventions are still very new to me.

Thanks again for your help, very much appreciated. I’ll work on integrating the labeling step in the call as suggested, I didn’t realize I could do this.

Topic		Replies	Views
Iterating over a DataFrame New to Julia iterative , dataframes , function	2	708	May 26, 2021
Split-Apply-Combine in arrays New to Julia splitapplycombine	9	590	April 28, 2023
Loopless calculations Data	12	662	January 11, 2024
Run multiple instances of transform on specific column combinations of a GroupedDataFrame in DataFrames mini language New to Julia question , dataframes	22	662	December 23, 2022
Efficiently finding the frequency of patterns in DataFrame columns New to Julia dataframes , dictionaries , splitapplycombine	12	1520	January 1, 2022

Iterative, looping split-apply-combines in Julia

this assigns, for each subject, the measure labels for each item of the Tuple

Related topics