DataFrame groups as an argument of a function

bojusemo · July 27, 2021, 1:26pm

Hi Guys,
May you help me, please?
How can I use the groups obtained from a DataFrame with the Split-Apply-Combine Strategy as an argument for a function into another function? I have try to do it by three ways but it does not work. See below the three ways I have tried.
Thank you so much in advance

# V1
using Chain
using DataFramesMeta
 function function_1(df_1::AbstractDataFrame, df_2::AbstractDataFrame)
    @chain df_1 begin
        groupby(:Column_ID)
        @combine  function_2([:,:], df_2)
    end
end

# V2
function function_1(df_1::AbstractDataFrame, df_2::AbstractDataFrame)
  combine(groupby(df_1, :Column_ID) .=> function_2([:,:], df_2) .=> :Result)
end

# V3
function function_1(df_1::AbstractDataFrame, df_2::AbstractDataFrame)
  ID = groupby(df_1, :Column_ID)
  combine(mgs, analysis_mgs(ID, df_2))
end

pdeffebach · July 27, 2021, 2:45pm

The @combine strategy won’t work. There is no way to refer to multiple columns at the moment.

Note, the x .=> ... part only applies when x is a list of columns. You don’t pipe the GroupedDataFrame into a function directly.

It sounds like you want the anonymous function version of combine.

julia> begin 
           using DataFrames, Chain
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           function analyze(df1, df2)
               DataFrame(x = first(df1.a), y = first(df2.c))
           end
           @chain df1 begin 
               groupby(:a)
               combine(_) do sdf1
                   analyze(sdf1, df2)
               end
           end
       end
2×3 DataFrame
 Row │ a      x      y     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      5
   2 │     2      2      5

bojusemo · July 29, 2021, 12:27pm

Thank you much @pdeffebach . That is what I was looking for

bojusemo · August 3, 2021, 3:06am

@pdeffebach,
Is it possible to break your function “analyze” based on the result? I am trying something like

julia> begin 
           using DataFrames, Chain
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           function analyze(df1, df2)
               DataFrame(x = first(df1.a), y = first(df2.c))
           end
           @chain df1 begin 
               groupby(:a)
               combine(_) do sdf1
                   analyze(sdf1, df2) == 1 && break # Where 1 is the expected result
               end
           end
       end

Thank you

pdeffebach · August 3, 2021, 1:24pm

combine is kind of hard to reason about in that way, since it creates an anonymous function. Maybe you want to use a for loop directly. something like

julia> begin 
           using DataFrames, Chain
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           function analyze(df1, df2)
               DataFrame(x = first(df1.a), y = first(df2.c))
           end
           @chain df1 begin 
               groupby(:a)
               for sdf1 in _
                   analyze(sdf1, df2) == 1 && break # Where 1 is the expected result
               end
           end
       end

bojusemo · August 7, 2021, 11:53am

Hi @pdeffebach,
My apologies for my late reply, I was working on another project those days.
Your solution works perfectly.
Thank you so much!
Boris

bojusemo · August 22, 2021, 4:05am

Hi @pdeffebach,

May you guide me, please?
Can multi-threading be designated to the “combine” function in this example?

Thank you

Boris

julia> begin 
           using DataFrames, Chain
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           function analyze(df1, df2)
               DataFrame(x = first(df1.a), y = first(df2.c))
           end
           @chain df1 begin 
               groupby(:a)
               combine(_) do sdf1
                   analyze(sdf1, df2)
               end
           end
       end

pdeffebach · August 22, 2021, 1:38pm

No, at least not transparently. Please open up a new thread to discuss this. I am not an expert on multithreading.

bojusemo · September 7, 2021, 1:22am

Hi @pdeffebach,

Thank you. I opened up a new thread to discuss it. Here is the link in case you want to follow it: Multi-threading to the “combine” function

Cheers

Boris

bojusemo · November 23, 2021, 2:15am

Hi @pdeffebach,

Could you help me, please? I want to include the combine function you showed me into a for loop for every element of a column of the other dataframe. I’ve tried the code below but it doesn’t work. Do you know how to do it?

Thanks

Boris

julia> begin 
           using DataFrames, Chain
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           function analyze(df1, df2)
               DataFrame(x = first(df1.a), y = first(df2.c))
           end
           @chain df2 begin
               groupby(:d)
               for sdf2 in _
               @chain df1 begin 
                   groupby(:a)
                   combine(_) do sdf1
                       analyze(sdf1, sdf2)
                   end
               end
           end
        end
     end

pdeffebach · November 23, 2021, 2:24am

for loops don’t automatically return in Julia. Which is why that block returns nothing. Its not clear what you want to do with this code, so it’s hard to help.

bojusemo · November 23, 2021, 3:55am

Thank you for your reply @pdeffebach,

I aim to do two for loops (or their equivalent with the combine option) in an efficient way. I thought of doing one for loop to multithreading and inside a combine because it is faster.

I want to do something like in the pseudocode below

julia> begin 
           df1 = DataFrame(a = [1, 2], b = [3, 4])
           df2 = DataFrame(c = [5, 6], d = [7, 8])
           for x in df1[a]
                for y in df2[c]
            do function analyze(x, y)

aplavin · November 23, 2021, 10:01am

The “for-loop that returns results” is map in Julia. So, your latest pseudocode example should look like this:

map(df1[a]) do x
    map(df2[c]) do y
        do function analyze(x, y)
    end
end
# results in a vector-of-vectors

This makes it trivial to parallelize as well: do import ThreadsX and replace the outer map with ThreadsX.map.
You can also combine these two loops/maps into a single one, which is often convenient:

map(Iterators.product(df1[a], df2[c])) do (x, y)
    do function analyze(x, y)
end
# results in a 2d array

nilshg · November 23, 2021, 11:35am

Just to add that df[a] is not supported syntax in DataFrames, it should be df[!, :a] or df.a.

bojusemo · November 23, 2021, 3:09pm

Thank you so much @aplavin and @nilshg, I really appreciate your help.

However, I think I didn’t clearly show what I need. What I need is to perform a function
```analyze`` to two subdataframes, one from df1 and another from df2. Something like the pseudocode below.

julia> begin 
           df1 = DataFrame(a = [ID1, ID1, ID2, ID2], b = [1, 2, 3, 4])
           df2 = DataFrame(c = [ID3, ID3, ID4, ID4], d = [5, 6, 7, 8])
           for x in unique(df1.a)
                for y in unique(df2.c)
                  do analyze(df1[df1.a[x], :], df2[df2.c[y], :])

In this case, I expect to use analyze four times:

analyze(df1[df1.a == ID1, :], df2[df2.c == ID3, :])
analyze(df1[df1.a == ID1, :], df2[df2.c == ID3, :])
analyze(df1[df1.a == ID2, :], df2[df2.c == ID4, :])
analyze(df1[df1.a == ID2, :], df2[df2.c == ID4, :])

The challenge is to do it in a fast manner and reduce as much as possible the allocations because df1 and df2 are huge.
That’s why I have used the library chain as discussed above. Is it possible to do it with map?

Thank you again for any help.

Cheers

Boris

nilshg · November 23, 2021, 3:18pm

You probably just want to take a view then? I.e. something like

result_vector = ResultType[]

for (x, y) in Iterators.product(unique(df1.a), unique(df2.c))
    df1_subset = @view df1[df1.a .== x, :]
    df2_subset = @view df2[df2.c .== y, :]
    push!(result_vector, analyze(df1_subset, df2_subset))
end

Topic		Replies	Views
Multi-threading to the “combine” function Performance multithreading , dataframes , piping	5	895	May 20, 2022
Using DataFrames `combine` is there a way to programmatically pass multiple functions to apply to the same same column? Data	9	860	January 20, 2023
Easier way to split-apply-combine in DataFrames.jl General Usage dataframes	5	1111	December 14, 2020
Iterative, looping split-apply-combines in Julia General Usage	17	931	November 24, 2020
Understanding the performance issue in combine() [DataFrames.jl] Performance dataframes	1	330	April 18, 2021

DataFrame groups as an argument of a function

Related topics