How to efficiently apply user-defined functions against grouped dataframes in `DataFrames.jl`?

I am translating about 6k lines of R to Julia for the sake of speed and everything has been faster, significantly faster in Julia except for this specific task. I suspect what I have done below may not be the right way to apply this logic because this specific logic is easily 10-12 times faster than Julia in the old R-lang.

Main question

What is the right way to apply user defined functions to be used inside of filter against grouped data frames in DataFrames.jl?

Task

Relevant objects

  1. Search_space - A vector (Have attached sample data below)
  2. Gdf = A grouped dataframe (Have attached sample data below)

Pseudocode

  1. Take a grouped dataframe
  2. Filter for groups where all the elements of column col_two is in search_space

I have tried the following combinations

function user_defined_bool_design(my_table::SubDataFrame, search_space::Array{String, 1})

    if all(x -> in(x, search_space), my_table.col_two)
        return true
    else
        return false
    end
    
end

@time begin
    filter(sdf -> user_defined_bool_design(sdf,search_space),Gdf,ungroup=true)
end

The outcome of the above

5.438332 seconds (95.41 k allocations: 12.553 MiB, 1.09% compilation time)

function waveform_filter_list_based(my_table::GroupedDataFrame, search_space::Array{String, 1})

    hold = SubDataFrame[]

    for group in my_table

        if all(x -> in(x, search_space), group.col_two)
            push!(hold, group)
        end 
    end

    return vcat(hold...)
end

@time begin
    waveform_filter_list_based(Gdf,search_space)
end

The outcome of the above

5.353175 seconds (50.79 k allocations: 4.650 MiB)

The sample data that I have attached is significantly scaled down so the numbers may not be the same. I am working on a linux machine with 16 threads and 32 gigs of RAM.

Here is a sample search_space

search_space = [
    6.6_1,6.5_1,12.6_1,12.5_1,3.8_1,12.3_1,12.7_1,20.9_1,12.2_1,12.4_1,21.4_1,12.8_1,11.9_1,12.0_1,21.8_1,12.1_1,12.9_1,20.8_1,21.3_1,8.9_1,21.9_1,21.5_1,21.7_1,22.1_1,21.6_1,22.0_1,11.8_1,6.7_1,13.0_1,21.1_1,21.0_1,10.2_1,13.1_1,21.2_1,13.2_1,18.1_1,20.7_1,8.1_1,22.2_1,8.6_1,11.7_1,8.5_1,8.0_1,10.5_1,22.4_1,10.6_1,3.8_2,22.1_2,21.1_2,8.7_2,10.3_2,12.7_2,13.1_2,11.9_2,21.2_2,8.8_2,20.8_2,11.7_2,10.1_2,20.7_2,21.4_2,11.6_2,12.0_2,21.6_2,21.0_2,21.7_2,20.6_2,10.2_2,8.1_2,22.0_2,21.3_2,10.4_2,10.0_2,1.4_2,21.8_2,20.9_2,11.5_2,13.2_2,21.5_2,22.2_2,15.0_2,1.2_2,13.3_2,22.3_2,15.4_2,17.9_2,8.0_2,8.5_2,10.7_2,13.4_2,22.9_2,21.9_2,10.5_2,15.7_2,18.0_2,22.5_2,15.5_2,18.2_2,15.6_2,8.9_2
]

Here is a sample table on pastebin: sample_dataframe.tsv - Pastebin.com

Please groupby col_one and work with col_two.

Is filter() just generally slower?

Have you tried making search_space a Set?

Note that this causes repeated access to the global search_space and thus is slow. That’s why you see so many allocations.

Can you please report the timing of:

function f(Gdf, search_space)
    sss = Set(search_space)
    [all(x -> in(x, sss), group.col_two) for group in Gdf]
end
@time condition  = f(Gdf, search_space)
@time Gdf[condition]

as on your data it is hard to benchmark since it is too small (I need to understand if the bottleneck is checking the condition or subsetting of Gdf).

The outcome of this was

0.134730 seconds (211.57 k allocations: 19.855 MiB, 91.86% compilation time)
0.017394 seconds (16.49 k allocations: 3.565 MiB, 95.68% compilation time)

FYI, you might be interested in this

just trivially doing this:

function user_defined_function_set_based(my_table::SubDataFrame, search_space::Set{String})

    if all(x -> in(x, search_space), my_table.col_two)
        return true
    else
        return false
    end
    
end

and then

@time begin 
    filter(sdf -> user_defined_function_set_based(sdf,search_space_set),Gdf,ungroup=true)
end

is 0.021326 seconds (61.83 k allocations: 10.356 MiB, 74.50% compilation time). This is on data of the same size.

Thank you! That was it. Changing list operations to set operations made things orders of magnitude faster!

Yes - the timing confirmed that the cost was not in DataFrames.jl but lookup cost.