 # Iterative, looping split-apply-combines in Julia

I am new to Julia (and the forum - apologies on my formatting!) and struggling to implement complex split-apply-combine functions, in particular when these involve iterative looping across columns. I could really use some help to develop an approach that, applied to different slices of the data, iteratively constructs pair-wise operations across multiple columns. I’ll give an example below that might seem solvable with brute force / manual approaches, but in reality this will be applied to a large array and so an automated/iterative solution is very much needed.

As an example, consider these data:

test=DataFrame(x1=rand(1000), x2=rand(1000), x3=rand(1000), x4=rand(1000))
test.subject=“a”
test[501:1000,5]=“b”

This yields 4 columns of data for two subjects (a, and b), each with 500 rows. I found that the by, combine, and map functions allow a nice solution to apply any given function to any given column by each slice (subject) - but, I’m trying to create a function that iterates across columns simultaneously.

Say for example that the goal is to compute some measure - difference between variable X1 and variable X2, and likewise X1 vs. X3, and X1 vs X4 - and, likewise, will want to do the same for X2 vs X1, and X2 vs, X3 and X2 vs X4 - every possible combination of pairwise differences, while avoiding duplication.

So the pseudocode would be something like:
function dostuff(x,y)
new_var=x-y
end
result=combine(df->dostuff(df.X1, df.X2), groupby(test, [:subject])

…but this obviously only calculates across 1 possible permutation, and I’m confused as to how to even approach automating this in Julia - nest a function call in a For Loop? How do I iteratively cycle through each permutation of columns I need to test? How do I provide the appropriate (and changing) inputs to the function each time it is iterated, i.e. the first “dostuff” would compare X1 and X2, but the next must compare X1 and X3 and so on until all combinations have been done, while avoiding duplication.

Any advice would be appreciated. The final result would be a dataframe that provides the (named) difference calculated in the “dostuff” function for each particular combination. (also I picked a simple subtraction just as an example, the real application involves a more complicated calculation. So if there’s a hardwired column differences function or something that won’t help)

Thanks for considering!

I’m a bit confused by this. Why do you want to use `combine`? It looks like all your operations are by row. Does `dostuff` produce a scalar or a vector?

1 Like

You need a combine because the operation is split/applied by subject, as well as iterated across multiple columns.

Okay. The hard part is generating the names. This should do what you want

``````julia> using Statistics, DataFrames;

julia> test=DataFrame(x1=rand(1000), x2=rand(1000), x3=rand(1000), x4=rand(1000));

julia> test.subject="a"
test[501:1000,5]= "b";

julia> function getproduct(names)
collect.( (vec ∘ collect ∘ Iterators.product)(names, names))
end;

julia> function dostuff(x, y)
mean(x) - mean(y)
end;

julia> combine(groupby(test, "subject"), getproduct(names(test, Not("subject"))) .=> dostuff)
2×17 DataFrame. Omitted printing of 8 columns
│ Row │ subject │ x1_x1_dostuff │ x2_x1_dostuff │ x3_x1_dostuff │ x4_x1_dostuff │ x1_x2_dostuff │ x2_x2_dostuff │ x3_x2_dostuff │ x4_x2_dostuff │
│     │ String  │ Float64       │ Float64       │ Float64       │ Float64       │ Float64       │ Float64       │ Float64       │ Float64       │
├─────┼─────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┤
│ 1   │ a       │ 0.0           │ 0.000719563   │ 2.31758e-5    │ -0.00244945   │ -0.000719563  │ 0.0           │ -0.000696388  │ -0.00316901   │
│ 2   │ b       │ 0.0           │ 0.0272609     │ 0.0261707     │ 0.0326415     │ -0.0272609    │ 0.0           │ -0.00109015   │ 0.00538059    │
``````

(side note, I invite the reader to think about how hard this would be in dplyr)

2 Likes

If I understand correctly, you are having troubles expressing an iteration over the pairs `(:x1,:x2)`, `(:x1,:x3)`, `(:x1,:x4)`
So, first get the list of the column names. I’m not very familiar with `DataFrames`, so the only way I could find is

``````colnames = getfield(test, :colindex).names
``````

but I imagine there is a nicer way. (Edit: the right way is `names(test)` or `propertynames(test)`)
Next, create a function that returns all possible distinct pairs:

``````function distinct_pairs(list)
indices = eachindex(list)
[(list[i], list[j]) for i in indices for j in indices if i != j]
end
``````

Now you can do something like

``````for (xname, yname) in distinct_pairs(colnames)
x = getproperty(test, xname)
y = getproperty(test, yname)
# do something with columns `x` and `y`
end
``````
1 Like

You can also use `names(test)`. `getfield(test, :colindex)` is not exposed API and shouldn’t be relied on.

2 Likes

Thanks, I found also `propertynames(test)`, which returns symbols instead of strings, which is more convenient to then use `getproperty`. Edit: oh nevermind, `getproperty` accepts strings as well.

1 Like

true, but in general in DataFrames it’s convention to do `df[!, name]` rather than `getproperty`.

5 Likes

Thanks so much to both of you. I don’t know if I’m more impressed by the speedy solutions, or that you were able to actually parse what I was going for from my confusing post. This is fantastic.

I’m going to play with these solutions and see what works, the actual function I’m working with is more complicated.

And yes, agreed this would be far from simple in dplyr. We previously approached this with nested for loops in Matlab but I was hoping for something a bit more elegant, as each of these approaches offers.

Thanks again. I’m looking into if multiple posts can be marked “solution”

Thanks for teaching me. I’ve only used a bit of pandas in the past, but never `DataFrames`, so I was sure I would have butchered the syntax and the conventions (I’ve actually installed `DataFrames` just to experiment with this question).

Since I realized that a significant portion of the problem was iterating over the pairs, I thought I would post a solution to that part.

3 Likes

Can this be extended if the output of the function isn’t a scalar?

So if for example instead of mean(x)-mean(y), the “dostuff” function yielded a named tuple for each iteration with 15 distinct measures?

Sorry, I should have clarified that the actual function I’m using yields a NamedTuple with 15 entries.

This yields an error when I try to apply this solution:
“Argument Error: a single value or vector result is required when passing multiple functions (got Array{Any,1})”

Converting this output to a vector or df within the function doesn’t solve.

This is possible with DataFrames in the most recently released version, `0.22`. What you want to do is transform the source to be `AsTable.(getproduct(...)) .=> dostuff => AsTable`

You will have to change `dostuff` so that it accepts a `NamedTuple` of vectors. You have to make this change because with this change, you will need to handle the names of the output on your own, so your function needs to see the names of the inputted columns. You do this via the `AsTable` wrapper in the source.

2 Likes

Appreciate the help, again!

I’m going to mark your previous reply as the solution as this ultimately led to the implementation I’m working with now, which does the trick.

Until I wrap my head around the naming/structure conventions, what I’ve done for now is include an extra little conversion step in my function, which outputs the Tuple as a vector.

For example given the “dostuff” function we’ve been discussing, which generates the problematic 15-item Tuple, I can still get the output vector I need if I include an extra step like:

function dostuff(x, y)
res=otherfunctionwith15outputs(x,y)
[i for i in res] #this is the modification
end;

…which yields my desired output in a “long form”, where for each subject I have 15 rows of output, each corresponding to the 15 lines of the Tuple. I can recover the output names with \ step, acting on the dataframe returned by the combine statement.

So: a workaround that functions for now, and good advice to focus on improving the structure in the future.

Thanks again!

Oh sorry, yes I read `named tuple`, implying you wanted many columns for output.

Yes, `Tuple`s do not get “stacked” after a `combine` call. Only vectors get stacked. You can do

``````combine(gd, :x => (collect ∘ dostuff) => :y)
``````

if `dostuff` returns a `Tuple`.

3 Likes

Sounds like a cleaner solution than what I came up with, but I’m not sure I’m following.

I think I’m missing what the “gd” in your combine statement is referencing, and how to maintain the columnar interactions (referenced per the getproduct function you shared) as well as the per-subject split-apply-combine?

function getproduct(names)
collect.( (vec ∘ collect ∘ Iterators.product)(names, names))
end;

function dostuff(x, y)
res=functionthatreturns15itemNamedTuple(x,y)
[i for i in a]
end;

#this returns, per “subject”, 15 rows of data, with columns for each column interaction as >#defined in get-product
result=combine(groupby(test, “subject”), getproduct(names(test, Not(“subject”))) .=> newrun)

#this is a vector of strings with the labels for the rows of the Tuple
labels=["item 1’, “item2”…]

#this creates a new variable, Measure, in my results data frame

# this assigns, for each subject, the measure labels for each item of the Tuple

result.measure=repeat(labels, length(unique(result.subject)))

The `gd` just means `groupby(test, "subject")`, i.e. the grouped data frame.

I think you are on the right track. I would just put the `labels` in your original combine call, i.e.

``````combine(groupby(test, "subject"), [] => (()->  labels) => "label", getproduct()...)
``````
1 Like

Ah, I see - yes, the conventions are still very new to me.

Thanks again for your help, very much appreciated. I’ll work on integrating the labeling step in the call as suggested, I didn’t realize I could do this.