DataFrame groups as an argument of a function

Hi Guys,
May you help me, please?
How can I use the groups obtained from a DataFrame with the Split-Apply-Combine Strategy as an argument for a function into another function? I have try to do it by three ways but it does not work. See below the three ways I have tried.
Thank you so much in advance

# V1
using Chain
using DataFramesMeta
 function function_1(df_1::AbstractDataFrame, df_2::AbstractDataFrame)
    @chain df_1 begin
        groupby(:Column_ID)
        @combine  function_2([:,:], df_2)
    end
end

# V2
function function_1(df_1::AbstractDataFrame, df_2::AbstractDataFrame)
  combine(groupby(df_1, :Column_ID) .=> function_2([:,:], df_2) .=> :Result)
end

# V3
function function_1(df_1::AbstractDataFrame, df_2::AbstractDataFrame)
  ID = groupby(df_1, :Column_ID)
  combine(mgs, analysis_mgs(ID, df_2))
end

The @combine strategy won’t work. There is no way to refer to multiple columns at the moment.

Note, the x .=> ... part only applies when x is a list of columns. You don’t pipe the GroupedDataFrame into a function directly.

It sounds like you want the anonymous function version of combine.

julia> begin 
           using DataFrames, Chain
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           function analyze(df1, df2)
               DataFrame(x = first(df1.a), y = first(df2.c))
           end
           @chain df1 begin 
               groupby(:a)
               combine(_) do sdf1
                   analyze(sdf1, df2)
               end
           end
       end
2×3 DataFrame
 Row │ a      x      y     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      5
   2 │     2      2      5

Thank you much @pdeffebach . That is what I was looking for

@pdeffebach,
Is it possible to break your function “analyze” based on the result? I am trying something like

julia> begin 
           using DataFrames, Chain
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           function analyze(df1, df2)
               DataFrame(x = first(df1.a), y = first(df2.c))
           end
           @chain df1 begin 
               groupby(:a)
               combine(_) do sdf1
                   analyze(sdf1, df2) == 1 && break # Where 1 is the expected result
               end
           end
       end

Thank you

combine is kind of hard to reason about in that way, since it creates an anonymous function. Maybe you want to use a for loop directly. something like

julia> begin 
           using DataFrames, Chain
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           function analyze(df1, df2)
               DataFrame(x = first(df1.a), y = first(df2.c))
           end
           @chain df1 begin 
               groupby(:a)
               for sdf1 in _
                   analyze(sdf1, df2) == 1 && break # Where 1 is the expected result
               end
           end
       end

Hi @pdeffebach,
My apologies for my late reply, I was working on another project those days.
Your solution works perfectly.
Thank you so much!
Boris

Hi @pdeffebach,

May you guide me, please?
Can multi-threading be designated to the “combine” function in this example?

Thank you

Boris

julia> begin 
           using DataFrames, Chain
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           function analyze(df1, df2)
               DataFrame(x = first(df1.a), y = first(df2.c))
           end
           @chain df1 begin 
               groupby(:a)
               combine(_) do sdf1
                   analyze(sdf1, df2)
               end
           end
       end

No, at least not transparently. Please open up a new thread to discuss this. I am not an expert on multithreading.

Hi @pdeffebach,

Thank you. I opened up a new thread to discuss it. Here is the link in case you want to follow it: Multi-threading to the “combine” function

Cheers

Boris

Hi @pdeffebach,

Could you help me, please? I want to include the combine function you showed me into a for loop for every element of a column of the other dataframe. I’ve tried the code below but it doesn’t work. Do you know how to do it?

Thanks

Boris

julia> begin 
           using DataFrames, Chain
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           function analyze(df1, df2)
               DataFrame(x = first(df1.a), y = first(df2.c))
           end
           @chain df2 begin
               groupby(:d)
               for sdf2 in _
               @chain df1 begin 
                   groupby(:a)
                   combine(_) do sdf1
                       analyze(sdf1, sdf2)
                   end
               end
           end
        end
     end

for loops don’t automatically return in Julia. Which is why that block returns nothing. Its not clear what you want to do with this code, so it’s hard to help.

Thank you for your reply @pdeffebach,

I aim to do two for loops (or their equivalent with the combine option) in an efficient way. I thought of doing one for loop to multithreading and inside a combine because it is faster.

I want to do something like in the pseudocode below

julia> begin 
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           for x in df1[a]
                for y in df2[c]
            do function analyze(x, y)
               

The “for-loop that returns results” is map in Julia. So, your latest pseudocode example should look like this:

map(df1[a]) do x
    map(df2[c]) do y
        do function analyze(x, y)
    end
end
# results in a vector-of-vectors

This makes it trivial to parallelize as well: do import ThreadsX and replace the outer map with ThreadsX.map.
You can also combine these two loops/maps into a single one, which is often convenient:

map(Iterators.product(df1[a], df2[c])) do (x, y)
    do function analyze(x, y)
end
# results in a 2d array
1 Like

Just to add that df[a] is not supported syntax in DataFrames, it should be df[!, :a] or df.a.

1 Like

Thank you so much @aplavin and @nilshg, I really appreciate your help.

However, I think I didn’t clearly show what I need. What I need is to perform a function
```analyze`` to two subdataframes, one from df1 and another from df2. Something like the pseudocode below.

julia> begin 
           df1 = DataFrame(a = [ID1, ID1, ID2, ID2], b = [1, 2, 3, 4])
           df2 = DataFrame(c = [ID3, ID3, ID4, ID4], d = [5, 6, 7, 8])
           for x in unique(df1.a)
                for y in unique(df2.c)
                  do analyze(df1[df1.a[x], :], df2[df2.c[y], :])

In this case, I expect to use analyze four times:

analyze(df1[df1.a == ID1, :], df2[df2.c == ID3, :])
analyze(df1[df1.a == ID1, :], df2[df2.c == ID3, :])
analyze(df1[df1.a == ID2, :], df2[df2.c == ID4, :])
analyze(df1[df1.a == ID2, :], df2[df2.c == ID4, :])

The challenge is to do it in a fast manner and reduce as much as possible the allocations because df1 and df2 are huge.
That’s why I have used the library chain as discussed above. Is it possible to do it with map?

Thank you again for any help.

Cheers

Boris

You probably just want to take a view then? I.e. something like

result_vector = ResultType[]

for (x, y) in Iterators.product(unique(df1.a), unique(df2.c))
    df1_subset = @view df1[df1.a .== x, :]
    df2_subset = @view df2[df2.c .== y, :]
    push!(result_vector, analyze(df1_subset, df2_subset))
end
1 Like