How to efficiently apply user-defined functions against grouped dataframes in `DataFrames.jl`?

I am translating about 6k lines of R to Julia for the sake of speed and everything has been faster, significantly faster in Julia except for this specific task. I suspect what I have done below may not be the right way to apply this logic because this specific logic is easily 10-12 times faster than Julia in the old R-lang.

Main question

What is the right way to apply user defined functions to be used inside of filter against grouped data frames in DataFrames.jl?

Task

Relevant objects

  1. Search_space - A vector (Have attached sample data below)
  2. Gdf = A grouped dataframe (Have attached sample data below)

Pseudocode

  1. Take a grouped dataframe
  2. Filter for groups where all the elements of column col_two is in search_space

I have tried the following combinations

function user_defined_bool_design(my_table::SubDataFrame, search_space::Array{String, 1})

    if all(x -> in(x, search_space), my_table.col_two)
        return true
    else
        return false
    end
    
end

@time begin
    filter(sdf -> user_defined_bool_design(sdf,search_space),Gdf,ungroup=true)
end

The outcome of the above

5.438332 seconds (95.41 k allocations: 12.553 MiB, 1.09% compilation time)

function waveform_filter_list_based(my_table::GroupedDataFrame, search_space::Array{String, 1})

    hold = SubDataFrame[]

    for group in my_table

        if all(x -> in(x, search_space), group.col_two)
            push!(hold, group)
        end 
    end

    return vcat(hold...)
end

@time begin
    waveform_filter_list_based(Gdf,search_space)
end

The outcome of the above

5.353175 seconds (50.79 k allocations: 4.650 MiB)

The sample data that I have attached is significantly scaled down so the numbers may not be the same. I am working on a linux machine with 16 threads and 32 gigs of RAM.

Here is a sample search_space

search_space = [
    6.6_1,6.5_1,12.6_1,12.5_1,3.8_1,12.3_1,12.7_1,20.9_1,12.2_1,12.4_1,21.4_1,12.8_1,11.9_1,12.0_1,21.8_1,12.1_1,12.9_1,20.8_1,21.3_1,8.9_1,21.9_1,21.5_1,21.7_1,22.1_1,21.6_1,22.0_1,11.8_1,6.7_1,13.0_1,21.1_1,21.0_1,10.2_1,13.1_1,21.2_1,13.2_1,18.1_1,20.7_1,8.1_1,22.2_1,8.6_1,11.7_1,8.5_1,8.0_1,10.5_1,22.4_1,10.6_1,3.8_2,22.1_2,21.1_2,8.7_2,10.3_2,12.7_2,13.1_2,11.9_2,21.2_2,8.8_2,20.8_2,11.7_2,10.1_2,20.7_2,21.4_2,11.6_2,12.0_2,21.6_2,21.0_2,21.7_2,20.6_2,10.2_2,8.1_2,22.0_2,21.3_2,10.4_2,10.0_2,1.4_2,21.8_2,20.9_2,11.5_2,13.2_2,21.5_2,22.2_2,15.0_2,1.2_2,13.3_2,22.3_2,15.4_2,17.9_2,8.0_2,8.5_2,10.7_2,13.4_2,22.9_2,21.9_2,10.5_2,15.7_2,18.0_2,22.5_2,15.5_2,18.2_2,15.6_2,8.9_2
]

Here is a sample table on pastebin: sample_dataframe.tsv - Pastebin.com

Please groupby col_one and work with col_two.

Is filter() just generally slower?

Have you tried making search_space a Set?

2 Likes

Note that this causes repeated access to the global search_space and thus is slow. That’s why you see so many allocations.

1 Like

Can you please report the timing of:

function f(Gdf, search_space)
    sss = Set(search_space)
    [all(x -> in(x, sss), group.col_two) for group in Gdf]
end
@time condition  = f(Gdf, search_space)
@time Gdf[condition]

as on your data it is hard to benchmark since it is too small (I need to understand if the bottleneck is checking the condition or subsetting of Gdf).

1 Like

The outcome of this was

0.134730 seconds (211.57 k allocations: 19.855 MiB, 91.86% compilation time)
0.017394 seconds (16.49 k allocations: 3.565 MiB, 95.68% compilation time)

FYI, you might be interested in this

just trivially doing this:

function user_defined_function_set_based(my_table::SubDataFrame, search_space::Set{String})

    if all(x -> in(x, search_space), my_table.col_two)
        return true
    else
        return false
    end
    
end

and then

@time begin 
    filter(sdf -> user_defined_function_set_based(sdf,search_space_set),Gdf,ungroup=true)
end

is 0.021326 seconds (61.83 k allocations: 10.356 MiB, 74.50% compilation time). This is on data of the same size.

Thank you! That was it. Changing list operations to set operations made things orders of magnitude faster!

2 Likes

Yes - the timing confirmed that the cost was not in DataFrames.jl but lookup cost.

2 Likes