Overwrite the subdataframes made with a for loop

Hi Guys,

Could someone help me, please?

I have a function with two dataframes (df1 and df2) as input that filter df1 based on a column that contains unique identifiers (IDs) and creates a subdataframe, makes a series of analyses and return 0 or 1 per each ID with a for loop. My issue is that the execution time and allocations increase exponentially with more IDs. It looks like the problem is that all subdataframes created with the filter are saved when they should be overwritten. How can I overwrite the subdataframes made in the function?

The code is in the topic Increase performance in string comparison with sub-dataframes or sub-arrays - #6 by bojusemo , and correspond to the sections “Filter a pre-allocating the array a” and “Analysis of potential a”

Can you please provide a MWE (emphasis on the minimum). The code you linked to is very long.

We don’t need to see the actual performance slowdown to help you, as long as you show an example we can copy and paste, which runs, unlike the code snippet you linked to, which relies on files in your computer.

1 Like

Hi @pdeffebach,

Thank you for your message. See below the example code. Please see in the last section, the difference in terms of allocations and time when the small dataframe is used vs the large one.

using DataFrames

# Dataframes.
small_df = DataFrame(ID = ["ID1","ID1","ID1","ID1","ID2","ID2","ID2","ID2","ID3","ID3","ID3","ID3","ID4","ID4","ID4","ID4"], Count = repeat(1:4, 4))
large_df = DataFrame(ID = ["ID1","ID1","ID1","ID1","ID2","ID2","ID2","ID2","ID3","ID3","ID3","ID3","ID4","ID4","ID4","ID4","ID5","ID5","ID5","ID5","ID6","ID6","ID6","ID6","ID7","ID7","ID7","ID7","ID8","ID8","ID8","ID8"], Count = repeat(1:4, 8))

# Filter dataframe based on the ID, i.e. obtain a subdataframe
function filter_function(df::DataFrame, ID::Vector{String})
  df_ID = df[ ( df.ID .== ID ) , :Count]
  df_ID
end

# Do something in each subdataframe, for example sum the elements of a column.
function analysis_ID(df::DataFrame)
  number_of_IDs::Int = length(unique(df[:, :ID]))

  for ID_number::Int = 1:number_of_IDs
    ID::Vector{String} = unique(df[:, :ID])[[ID_number]]
    sum(filter_function(df, ID)) 
  end
end


## Check performance
using BenchmarkTools
using Profile
using ProfileView

# df1_small
@time analysis_ID(small_df)
@benchmark analysis_ID(small_df)
@profview analysis_ID(small_df)

# df1_large
@time analysis_ID(large_df)
@benchmark analysis_ID(large_df)
@profview analysis_ID(large_df)

This code has a lot of unnecessary allocations and could be simplified.

For instance, you write unique(df[:, :ID]) many times, when it never changes. Why do it over and over again?

Why do you index with a Vector, i.e. [[ID_Number]], this creates a 1-element vector instead of just using the String directly.

You use the term SubDataFrame, but note that df_ID = df[ ( df.ID .== ID ) , :Count] does not create a SubDataFrame. It creates a new data frame that is a subset of the first one.

You have many unnecessary type annotations, like ID::Vector{String}. These do not help with performance. They might make your code more readable, but they also make your code more brittle if you end up with, say, AbstractString types that are not String.

Here is a better version of the code above

# Filter dataframe based on the ID, i.e. obtain a subdataframe
function filter_function(df::DataFrame, ID)
  df.Count[df.ID .== ID]
end

# Do something in each subdataframe, for example sum the elements of a column.
function analysis_ID(df::DataFrame)
  u = unique(df[!, :ID])

  for ui in u
    sum(filter_function(df, ui)) 
  end
end

Though think it’s best thinking of this as a split-apply-combine problem. We have canned functions to do this easily

using DataFramesMeta

 function analysis_ID_better(df::AbstractDataFrame)
    @chain df begin 
        groupby(:ID)
        @combine :Cound_sum = sum(:Count)
    end
end

This is slower, though, and I have to pin down why.

1 Like

Update, the version with combine actually is faster. We were just working with a very small data frame. See below

actually_large_df = DataFrame(ID = rand(large_df.ID, 100_000), Count = rand(large_df.Count, 100_000));

@btime analysis_ID_better(actually_large_df);
  2.476 ms (258 allocations: 2.54 MiB)

@btime analysis_ID(actually_large_df);
  10.774 ms (136 allocations: 919.75 KiB)
1 Like

Hi @pdeffebach,

Thank you so much for your feedback. I cannot check it in detail at this moment because I am in Australia and it is late, but I’ll check it tomorrow and reply to your comments.

Cheers

Boris

Hi @pdeffebach,

May you guide me, please? When I use your function (below), it does not compile, I have this error:

LoadError: ArgumentError: This path should not be reached
in expression starting at C:\Users…\path\script.jl:34
fun_to_vec(ex::Expr; nolhs::Bool, gensym_names::Bool, wrap_byrow::Bool) at parsing.jl:312
(::DataFramesMeta.var"#fun_to_vec##kw")(::NamedTuple{(:gensym_names, :nolhs, :wrap_byrow), Tuple{Bool, Bool, Bool}}, ::typeof(DataFramesMeta.fun_to_vec), ex::Expr) at parsing.jl:193
(::DataFramesMeta.var"#37#38"{Bool})(ex::Expr) at none:0
iterate(::Base.Generator{Vector{Any}, DataFramesMeta.var"#37#38"{Bool}}) at generator.jl:47
combine_helper(x::Symbol, args::Expr; deprecation_warning::Bool) at macros.jl:1222
combine_helper(x::Symbol, args::Expr) at macros.jl:1198
var"@combine"(source::LineNumberNode, module::Module, x::Any, args::Vararg{Any, N} where N) at macros.jl:1307
eval at boot.jl:360 [inlined]

Funciton:

Sorry! DataFramesMeta.jl recently made the change to have :x on the LHS instead of x. This will work

using DataFramesMeta

 function analysis_ID_better(df::AbstractDataFrame)
    @chain df begin 
        groupby(:ID)
        @combine Count_sum = sum(:Count)
    end
end

Hi @pdeffebach ,

Thank you, your last functions compilates. However, it is slower than the previous (see below). Do you know how can it be improved?

@btime analysis_ID(small_df)
11.800 μs (71 allocations: 6.17 KiB)
@btime analysis_ID_better(small_df)
272.300 μs (254 allocations: 18.81 KiB)

As I showed above, the combine solution scales better to large data frames, so it’s faster “where it counts”.

That said, the pre and post-processing logic of groupby and combine is somewhat complicated, and the cost of this totally swamps the actual cost of the sum for the very small data set you are working with.

I guess my question is, why are you interested in the speed of this function on a very small dataset? Do you have thousands (or millions) of small datasets you need to apply this function on?

1 Like

Hi @pdeffebach,
Thank you. Yes, it is working faster in your large dataset. I am actually working with a large dataset as well. So, I’ll apply this to the logic of my code.
Thank so much again