Using a DataFrame to calculate another column in a separate DataFrame

Derek_Vetsch · June 4, 2020, 5:20pm

I am using a DataFrame invoices that looks like this:

to count the number of invoices for a given contract on a given date that have happened. The functions I’m using to do this are:

using DataFrames, DataFramesMeta, Dates, Lazy

function num_invoices_setup(dt::Date, cak_id::String)
    (@> begin
        invoices::DataFrame
        @where :contract_id .== cak_id
        @where :due_date .<= dt
    end)::DataFrame # begin
end # function



function num_invoices(dt::Date, cak_id::String, mod_post_date::Union{Date, Missing}, mod_effective_date::Union{Date, Missing})
    df::DataFrame = num_invoices_setup(dt, cak_id)
    if mod_effective_date |> ismissing
        return (nrow(df))::Int64
    else
        return (filter(r -> r.due_date > mod_effective_date, df) |> nrow)::Int64
    end # if
end # function

Everything works but takes a really long time to complete when using num_invoices to calculate a column in a DataFrame that is ~1.2M rows.

Here is the output of @code_warntype for num_invoices:

![Annotation warntype|690x348](upload://dxnug8XCKKaFGaspoPcUrO10yDi.png)

So right now it is taking about 2 hours for num_invoices. Any suggestions for how to speed this up?

pdeffebach · June 4, 2020, 6:08pm

Welcome to the forum! Glad you are using dataframes.

2 hours is a very long time for this kind of calculation. This should be done in a few seconds at most using combine. Using the latest version of DataFrames you can do

combine(groupby(df, [:contract_id, :due_date]), nrow)

As a side note, you really don’t need the type annotations you are using. They don’t help with performance and will make things more fragile than they need to be.

Derek_Vetsch · June 6, 2020, 12:59am

Thanks for the reply. Sorry, I think I explained my issue poorly. So I need to call num_invoices as the function in transform!, so the two hours comes from ~0.007 seconds times the 1.25 million rows in the DataFrame that transform! updates. I agree that your implementation is cleaner and more concise, but am struggling to find a way to speed this up beyond that…

nilshg · June 6, 2020, 9:26am

Can you maybe make an example DataFrame with some random data that roughly matches what you’re working on and the output you’re expecting?

I agree with Peter that it feels like it should be possible to do this a lot faster, particularly since your code only seems to count rows of subgroups of your DataFrame, but it’s currently not clear to me what exactly you’re trying to do here.

dlakelan · June 6, 2020, 12:37pm

it sounds a little like you’re doing a “join” and should do the join all at once Joins · DataFrames.jl

Derek_Vetsch · June 10, 2020, 5:02am

Sorry for the delayed response - couldn’t get to a REPL for a few days.
A join is exactly what I am trying to do… I probably should have thought of that…

Just out of curiosity, is there a way way to do something akin to SQL left join tbl1 on tbl1.column1 = tbl2.column2 and tbl1.column3 < tbl2.column4? The less than condition doesn’t seem super intuitive from reading the docs, but I imagine it’s more likely a me problem.

dlakelan · June 10, 2020, 6:36pm

I believe what you want is a join on column1=column2 followed by a filter on column3 < column4

If you have some complicated SQL you’d like to carry out on this data, you could always try slamming the DataFrames into SQLite and then running an SQL query against it.

Topic		Replies	Views
Is there a faster way to extract single rows of a dataframe than the filter function General Usage	3	660	February 18, 2020
Help with performance tuning this dataframe aggregation Performance	10	750	September 23, 2018
Large dataframe. fast row selection Data query , dataframes	5	2421	September 13, 2018
Row wise operations in DataFrames Data dataframes	3	2330	September 30, 2018
Iterate across two DataFrames using Query.jl New to Julia query	5	1175	November 19, 2018

Using a DataFrame to calculate another column in a separate DataFrame

Related topics