Iterative, looping split-apply-combines in Julia

I am new to Julia (and the forum - apologies on my formatting!) and struggling to implement complex split-apply-combine functions, in particular when these involve iterative looping across columns. I could really use some help to develop an approach that, applied to different slices of the data, iteratively constructs pair-wise operations across multiple columns. I’ll give an example below that might seem solvable with brute force / manual approaches, but in reality this will be applied to a large array and so an automated/iterative solution is very much needed.

As an example, consider these data:

test=DataFrame(x1=rand(1000), x2=rand(1000), x3=rand(1000), x4=rand(1000))
test.subject=β€œa”
test[501:1000,5]=β€œb”

This yields 4 columns of data for two subjects (a, and b), each with 500 rows. I found that the by, combine, and map functions allow a nice solution to apply any given function to any given column by each slice (subject) - but, I’m trying to create a function that iterates across columns simultaneously.

Say for example that the goal is to compute some measure - difference between variable X1 and variable X2, and likewise X1 vs. X3, and X1 vs X4 - and, likewise, will want to do the same for X2 vs X1, and X2 vs, X3 and X2 vs X4 - every possible combination of pairwise differences, while avoiding duplication.

So the pseudocode would be something like:
function dostuff(x,y)
new_var=x-y
end
result=combine(df->dostuff(df.X1, df.X2), groupby(test, [:subject])

…but this obviously only calculates across 1 possible permutation, and I’m confused as to how to even approach automating this in Julia - nest a function call in a For Loop? How do I iteratively cycle through each permutation of columns I need to test? How do I provide the appropriate (and changing) inputs to the function each time it is iterated, i.e. the first β€œdostuff” would compare X1 and X2, but the next must compare X1 and X3 and so on until all combinations have been done, while avoiding duplication.

Any advice would be appreciated. The final result would be a dataframe that provides the (named) difference calculated in the β€œdostuff” function for each particular combination. (also I picked a simple subtraction just as an example, the real application involves a more complicated calculation. So if there’s a hardwired column differences function or something that won’t help)

Thanks for considering!

I’m a bit confused by this. Why do you want to use combine? It looks like all your operations are by row. Does dostuff produce a scalar or a vector?

1 Like

You need a combine because the operation is split/applied by subject, as well as iterated across multiple columns.

Okay. The hard part is generating the names. This should do what you want

julia> using Statistics, DataFrames;

julia> test=DataFrame(x1=rand(1000), x2=rand(1000), x3=rand(1000), x4=rand(1000));

julia> test.subject="a"
       test[501:1000,5]= "b";

julia> function getproduct(names)
           collect.( (vec ∘ collect ∘ Iterators.product)(names, names))
       end;

julia> function dostuff(x, y)
           mean(x) - mean(y)
       end;

julia> combine(groupby(test, "subject"), getproduct(names(test, Not("subject"))) .=> dostuff)
2Γ—17 DataFrame. Omitted printing of 8 columns
β”‚ Row β”‚ subject β”‚ x1_x1_dostuff β”‚ x2_x1_dostuff β”‚ x3_x1_dostuff β”‚ x4_x1_dostuff β”‚ x1_x2_dostuff β”‚ x2_x2_dostuff β”‚ x3_x2_dostuff β”‚ x4_x2_dostuff β”‚
β”‚     β”‚ String  β”‚ Float64       β”‚ Float64       β”‚ Float64       β”‚ Float64       β”‚ Float64       β”‚ Float64       β”‚ Float64       β”‚ Float64       β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ a       β”‚ 0.0           β”‚ 0.000719563   β”‚ 2.31758e-5    β”‚ -0.00244945   β”‚ -0.000719563  β”‚ 0.0           β”‚ -0.000696388  β”‚ -0.00316901   β”‚
β”‚ 2   β”‚ b       β”‚ 0.0           β”‚ 0.0272609     β”‚ 0.0261707     β”‚ 0.0326415     β”‚ -0.0272609    β”‚ 0.0           β”‚ -0.00109015   β”‚ 0.00538059    β”‚

(side note, I invite the reader to think about how hard this would be in dplyr)

2 Likes

If I understand correctly, you are having troubles expressing an iteration over the pairs (:x1,:x2), (:x1,:x3), (:x1,:x4)…
So, first get the list of the column names. I’m not very familiar with DataFrames, so the only way I could find is

colnames = getfield(test, :colindex).names

but I imagine there is a nicer way. (Edit: the right way is names(test) or propertynames(test))
Next, create a function that returns all possible distinct pairs:

function distinct_pairs(list)
  indices = eachindex(list)
  [(list[i], list[j]) for i in indices for j in indices if i != j]
end

Now you can do something like

for (xname, yname) in distinct_pairs(colnames)
  x = getproperty(test, xname)
  y = getproperty(test, yname)
  # do something with columns `x` and `y`
end
1 Like

You can also use names(test). getfield(test, :colindex) is not exposed API and shouldn’t be relied on.

2 Likes

Thanks, I found also propertynames(test), which returns symbols instead of strings, which is more convenient to then use getproperty. Edit: oh nevermind, getproperty accepts strings as well.

1 Like

true, but in general in DataFrames it’s convention to do df[!, name] rather than getproperty.

5 Likes

Thanks so much to both of you. I don’t know if I’m more impressed by the speedy solutions, or that you were able to actually parse what I was going for from my confusing post. This is fantastic.

I’m going to play with these solutions and see what works, the actual function I’m working with is more complicated.

And yes, agreed this would be far from simple in dplyr. We previously approached this with nested for loops in Matlab but I was hoping for something a bit more elegant, as each of these approaches offers.

Thanks again. I’m looking into if multiple posts can be marked β€œsolution”

Thanks for teaching me. I’ve only used a bit of pandas in the past, but never DataFrames, so I was sure I would have butchered the syntax and the conventions :smiley: (I’ve actually installed DataFrames just to experiment with this question).

Since I realized that a significant portion of the problem was iterating over the pairs, I thought I would post a solution to that part.

3 Likes

Can this be extended if the output of the function isn’t a scalar?

So if for example instead of mean(x)-mean(y), the β€œdostuff” function yielded a named tuple for each iteration with 15 distinct measures?

Sorry, I should have clarified that the actual function I’m using yields a NamedTuple with 15 entries.

This yields an error when I try to apply this solution:
β€œArgument Error: a single value or vector result is required when passing multiple functions (got Array{Any,1})”

Converting this output to a vector or df within the function doesn’t solve.

This is possible with DataFrames in the most recently released version, 0.22. What you want to do is transform the source to be AsTable.(getproduct(...)) .=> dostuff => AsTable

You will have to change dostuff so that it accepts a NamedTuple of vectors. You have to make this change because with this change, you will need to handle the names of the output on your own, so your function needs to see the names of the inputted columns. You do this via the AsTable wrapper in the source.

2 Likes

Appreciate the help, again!

I’m going to mark your previous reply as the solution as this ultimately led to the implementation I’m working with now, which does the trick.

Until I wrap my head around the naming/structure conventions, what I’ve done for now is include an extra little conversion step in my function, which outputs the Tuple as a vector.

For example given the β€œdostuff” function we’ve been discussing, which generates the problematic 15-item Tuple, I can still get the output vector I need if I include an extra step like:

function dostuff(x, y)
res=otherfunctionwith15outputs(x,y)
[i for i in res] #this is the modification
end;

…which yields my desired output in a β€œlong form”, where for each subject I have 15 rows of output, each corresponding to the 15 lines of the Tuple. I can recover the output names with \ step, acting on the dataframe returned by the combine statement.

So: a workaround that functions for now, and good advice to focus on improving the structure in the future.

Thanks again!

Oh sorry, yes I read named tuple, implying you wanted many columns for output.

Yes, Tuples do not get β€œstacked” after a combine call. Only vectors get stacked. You can do

combine(gd, :x => (collect ∘ dostuff) => :y)

if dostuff returns a Tuple.

3 Likes

Sounds like a cleaner solution than what I came up with, but I’m not sure I’m following.

I think I’m missing what the β€œgd” in your combine statement is referencing, and how to maintain the columnar interactions (referenced per the getproduct function you shared) as well as the per-subject split-apply-combine?

function getproduct(names)
collect.( (vec ∘ collect ∘ Iterators.product)(names, names))
end;

function dostuff(x, y)
res=functionthatreturns15itemNamedTuple(x,y)
[i for i in a]
end;

#this returns, per β€œsubject”, 15 rows of data, with columns for each column interaction as >#defined in get-product
result=combine(groupby(test, β€œsubject”), getproduct(names(test, Not(β€œsubject”))) .=> newrun)

#this is a vector of strings with the labels for the rows of the Tuple
labels=["item 1’, β€œitem2”…]

#this creates a new variable, Measure, in my results data frame

this assigns, for each subject, the measure labels for each item of the Tuple

result.measure=repeat(labels, length(unique(result.subject)))

The gd just means groupby(test, "subject"), i.e. the grouped data frame.

I think you are on the right track. I would just put the labels in your original combine call, i.e.

combine(groupby(test, "subject"), [] => (()->  labels) => "label", getproduct()...)
1 Like

Ah, I see - yes, the conventions are still very new to me.

Thanks again for your help, very much appreciated. I’ll work on integrating the labeling step in the call as suggested, I didn’t realize I could do this.