Using a DataFrame to calculate another column in a separate DataFrame

I am using a DataFrame invoices that looks like this:
image
to count the number of invoices for a given contract on a given date that have happened. The functions I’m using to do this are:

using DataFrames, DataFramesMeta, Dates, Lazy

function num_invoices_setup(dt::Date, cak_id::String)
    (@> begin
        invoices::DataFrame
        @where :contract_id .== cak_id
        @where :due_date .<= dt
    end)::DataFrame # begin
end # function



function num_invoices(dt::Date, cak_id::String, mod_post_date::Union{Date, Missing}, mod_effective_date::Union{Date, Missing})
    df::DataFrame = num_invoices_setup(dt, cak_id)
    if mod_effective_date |> ismissing
        return (nrow(df))::Int64
    else
        return (filter(r -> r.due_date > mod_effective_date, df) |> nrow)::Int64
    end # if
end # function

Everything works but takes a really long time to complete when using num_invoices to calculate a column in a DataFrame that is ~1.2M rows.

Here is the output of @code_warntype for num_invoices:

![Annotation warntype|690x348](upload://dxnug8XCKKaFGaspoPcUrO10yDi.png) 

So right now it is taking about 2 hours for num_invoices. Any suggestions for how to speed this up?

Welcome to the forum! Glad you are using dataframes.

2 hours is a very long time for this kind of calculation. This should be done in a few seconds at most using combine. Using the latest version of DataFrames you can do

combine(groupby(df, [:contract_id, :due_date]), nrow)

As a side note, you really don’t need the type annotations you are using. They don’t help with performance and will make things more fragile than they need to be.

2 Likes

Thanks for the reply. Sorry, I think I explained my issue poorly. So I need to call num_invoices as the function in transform!, so the two hours comes from ~0.007 seconds times the 1.25 million rows in the DataFrame that transform! updates. I agree that your implementation is cleaner and more concise, but am struggling to find a way to speed this up beyond that…

Can you maybe make an example DataFrame with some random data that roughly matches what you’re working on and the output you’re expecting?

I agree with Peter that it feels like it should be possible to do this a lot faster, particularly since your code only seems to count rows of subgroups of your DataFrame, but it’s currently not clear to me what exactly you’re trying to do here.

it sounds a little like you’re doing a “join” and should do the join all at once Joins · DataFrames.jl

Sorry for the delayed response - couldn’t get to a REPL for a few days.
A join is exactly what I am trying to do… I probably should have thought of that…

Just out of curiosity, is there a way way to do something akin to SQL left join tbl1 on tbl1.column1 = tbl2.column2 and tbl1.column3 < tbl2.column4? The less than condition doesn’t seem super intuitive from reading the docs, but I imagine it’s more likely a me problem.

I believe what you want is a join on column1=column2 followed by a filter on column3 < column4

If you have some complicated SQL you’d like to carry out on this data, you could always try slamming the DataFrames into SQLite and then running an SQL query against it.