How to efficiently apply user-defined functions against grouped dataframes in `DataFrames.jl`?

anon56682678 · February 9, 2024, 3:57am

I am translating about 6k lines of R to Julia for the sake of speed and everything has been faster, significantly faster in Julia except for this specific task. I suspect what I have done below may not be the right way to apply this logic because this specific logic is easily 10-12 times faster than Julia in the old R-lang.

Main question

What is the right way to apply user defined functions to be used inside of filter against grouped data frames in DataFrames.jl?

Task

Relevant objects

Search_space - A vector (Have attached sample data below)
Gdf = A grouped dataframe (Have attached sample data below)

Pseudocode

Take a grouped dataframe
Filter for groups where all the elements of column col_two is in search_space

I have tried the following combinations

function user_defined_bool_design(my_table::SubDataFrame, search_space::Array{String, 1})

    if all(x -> in(x, search_space), my_table.col_two)
        return true
    else
        return false
    end
    
end

@time begin
    filter(sdf -> user_defined_bool_design(sdf,search_space),Gdf,ungroup=true)
end

The outcome of the above

5.438332 seconds (95.41 k allocations: 12.553 MiB, 1.09% compilation time)

function waveform_filter_list_based(my_table::GroupedDataFrame, search_space::Array{String, 1})

    hold = SubDataFrame[]

    for group in my_table

        if all(x -> in(x, search_space), group.col_two)
            push!(hold, group)
        end 
    end

    return vcat(hold...)
end

@time begin
    waveform_filter_list_based(Gdf,search_space)
end

The outcome of the above

5.353175 seconds (50.79 k allocations: 4.650 MiB)

The sample data that I have attached is significantly scaled down so the numbers may not be the same. I am working on a linux machine with 16 threads and 32 gigs of RAM.

Here is a sample search_space

search_space = [
    6.6_1,6.5_1,12.6_1,12.5_1,3.8_1,12.3_1,12.7_1,20.9_1,12.2_1,12.4_1,21.4_1,12.8_1,11.9_1,12.0_1,21.8_1,12.1_1,12.9_1,20.8_1,21.3_1,8.9_1,21.9_1,21.5_1,21.7_1,22.1_1,21.6_1,22.0_1,11.8_1,6.7_1,13.0_1,21.1_1,21.0_1,10.2_1,13.1_1,21.2_1,13.2_1,18.1_1,20.7_1,8.1_1,22.2_1,8.6_1,11.7_1,8.5_1,8.0_1,10.5_1,22.4_1,10.6_1,3.8_2,22.1_2,21.1_2,8.7_2,10.3_2,12.7_2,13.1_2,11.9_2,21.2_2,8.8_2,20.8_2,11.7_2,10.1_2,20.7_2,21.4_2,11.6_2,12.0_2,21.6_2,21.0_2,21.7_2,20.6_2,10.2_2,8.1_2,22.0_2,21.3_2,10.4_2,10.0_2,1.4_2,21.8_2,20.9_2,11.5_2,13.2_2,21.5_2,22.2_2,15.0_2,1.2_2,13.3_2,22.3_2,15.4_2,17.9_2,8.0_2,8.5_2,10.7_2,13.4_2,22.9_2,21.9_2,10.5_2,15.7_2,18.0_2,22.5_2,15.5_2,18.2_2,15.6_2,8.9_2
]

Here is a sample table on pastebin: sample_dataframe.tsv - Pastebin.com

Please groupby col_one and work with col_two.

Is filter() just generally slower?

nilshg · February 9, 2024, 5:50am

Have you tried making search_space a Set?

abraemer · February 9, 2024, 6:24am

Note that this causes repeated access to the global search_space and thus is slow. That’s why you see so many allocations.

bkamins · February 9, 2024, 11:32am

Can you please report the timing of:

function f(Gdf, search_space)
    sss = Set(search_space)
    [all(x -> in(x, sss), group.col_two) for group in Gdf]
end
@time condition  = f(Gdf, search_space)
@time Gdf[condition]

as on your data it is hard to benchmark since it is too small (I need to understand if the bottleneck is checking the condition or subsetting of Gdf).

anon56682678 · February 9, 2024, 3:36pm

The outcome of this was

0.134730 seconds (211.57 k allocations: 19.855 MiB, 91.86% compilation time)
0.017394 seconds (16.49 k allocations: 3.565 MiB, 95.68% compilation time)

FYI, you might be interested in this

just trivially doing this:

function user_defined_function_set_based(my_table::SubDataFrame, search_space::Set{String})

    if all(x -> in(x, search_space), my_table.col_two)
        return true
    else
        return false
    end
    
end

and then

@time begin 
    filter(sdf -> user_defined_function_set_based(sdf,search_space_set),Gdf,ungroup=true)
end

is 0.021326 seconds (61.83 k allocations: 10.356 MiB, 74.50% compilation time). This is on data of the same size.

anon56682678 · February 9, 2024, 3:36pm

Thank you! That was it. Changing list operations to set operations made things orders of magnitude faster!

bkamins · February 10, 2024, 11:46am

Yes - the timing confirmed that the cost was not in DataFrames.jl but lookup cost.

Topic		Replies	Views
Filter doesn't work on grouped dataframe General Usage dataframes	5	1558	February 4, 2022
Is there a faster way to filter in dataframes New to Julia dataframes , namedtuple	9	1293	August 29, 2021
DataFramesMeta custom filter: by groups of A, apply filter on B Data	4	1418	May 27, 2019
Applying group selection conditions using groupeddataframes New to Julia dataframes	21	2219	September 20, 2020
Groupby on an expression or a vector? New to Julia	21	568	June 11, 2024

How to efficiently apply user-defined functions against grouped dataframes in `DataFrames.jl`?

Main question

Task

Relevant objects

Pseudocode

Related topics