I am translating about 6k lines of R to Julia for the sake of speed and everything has been faster, significantly faster in Julia except for this specific task. I suspect what I have done below may not be the right way to apply this logic because this specific logic is easily 10-12 times faster than Julia in the old R-lang.
Main question
What is the right way to apply user defined functions to be used inside of filter
against grouped data frames in DataFrames.jl
?
Task
Relevant objects
- Search_space - A vector (Have attached sample data below)
- Gdf = A grouped dataframe (Have attached sample data below)
Pseudocode
- Take a grouped dataframe
- Filter for groups where all the elements of column
col_two
is in search_space
I have tried the following combinations
function user_defined_bool_design(my_table::SubDataFrame, search_space::Array{String, 1})
if all(x -> in(x, search_space), my_table.col_two)
return true
else
return false
end
end
@time begin
filter(sdf -> user_defined_bool_design(sdf,search_space),Gdf,ungroup=true)
end
The outcome of the above
5.438332 seconds (95.41 k allocations: 12.553 MiB, 1.09% compilation time)
function waveform_filter_list_based(my_table::GroupedDataFrame, search_space::Array{String, 1})
hold = SubDataFrame[]
for group in my_table
if all(x -> in(x, search_space), group.col_two)
push!(hold, group)
end
end
return vcat(hold...)
end
@time begin
waveform_filter_list_based(Gdf,search_space)
end
The outcome of the above
5.353175 seconds (50.79 k allocations: 4.650 MiB)
The sample data that I have attached is significantly scaled down so the numbers may not be the same. I am working on a linux machine with 16 threads and 32 gigs of RAM.
Here is a sample search_space
search_space = [
6.6_1,6.5_1,12.6_1,12.5_1,3.8_1,12.3_1,12.7_1,20.9_1,12.2_1,12.4_1,21.4_1,12.8_1,11.9_1,12.0_1,21.8_1,12.1_1,12.9_1,20.8_1,21.3_1,8.9_1,21.9_1,21.5_1,21.7_1,22.1_1,21.6_1,22.0_1,11.8_1,6.7_1,13.0_1,21.1_1,21.0_1,10.2_1,13.1_1,21.2_1,13.2_1,18.1_1,20.7_1,8.1_1,22.2_1,8.6_1,11.7_1,8.5_1,8.0_1,10.5_1,22.4_1,10.6_1,3.8_2,22.1_2,21.1_2,8.7_2,10.3_2,12.7_2,13.1_2,11.9_2,21.2_2,8.8_2,20.8_2,11.7_2,10.1_2,20.7_2,21.4_2,11.6_2,12.0_2,21.6_2,21.0_2,21.7_2,20.6_2,10.2_2,8.1_2,22.0_2,21.3_2,10.4_2,10.0_2,1.4_2,21.8_2,20.9_2,11.5_2,13.2_2,21.5_2,22.2_2,15.0_2,1.2_2,13.3_2,22.3_2,15.4_2,17.9_2,8.0_2,8.5_2,10.7_2,13.4_2,22.9_2,21.9_2,10.5_2,15.7_2,18.0_2,22.5_2,15.5_2,18.2_2,15.6_2,8.9_2
]
Here is a sample table on pastebin: sample_dataframe.tsv - Pastebin.com
Please groupby col_one and work with col_two.
Is filter()
just generally slower?