Need for speed: looping over subdataframes to construct lags

I have a very large DataFrame with columns j, t, b, b_t.

I am creating a new column s by df.s = df.b
and then I am making changes to the values in s depending on certain conditions of whether b_t and t differ, for each group j. The following code shows the changes I am making. It works but it takes forever:

for g in groupby(df, :j)
   g.s = ifelse.(g.b_t .== g.t .- 1, lag(g.b, -1), g.b )
end

The lag function uses ShiftedArrays package.

Even putting the above loop in a function, and making it available @everywhere with using Distributed to parallelize it, only has modest gains in speed. I am wondering is there an easy efficiency gain that I am missing here. Thanks!

First comment is that df.s = df.b is not a good pattern. It creates alias of :s and :b. They have the same memory location. Do transform!(df, :b => :s) or df.s = copy(df.b) to ensure you have a copy.

Having said this it is likely this is not needed at all as it should be enough to just write:

transform!(groupby(df, :j), [:b, :t, :b_t] => ((b, t, b_t) -> ifelse.(b_t .== t .- 1, lag(b, -1), b)) => :s)

It is possible to further improve the performance as this solution does some unnecessary allocations, but maybe this is already good enough.

2 Likes

Thanks so much @bkamins ! It improved speed. And also thanks for correcting my sloppy assignment of b !

Can you clarify what is the desired result here? The lag function depends on the ordering of rows. Yet a database relation has no record order. A DataFrame does have row order, but it is implicit with DataFrame construction, and the groupby guarantee of order should be looked up in the docs or code or @bkamins. If b_t and t are “time” ordering within groups, perhaps somehow sorting the DataFrame can make this whole operation quicker.

Yes they are time variables. The data was already sorted. I did not mention that small detail because if not, the exercise would have been incorrect regardless.

groupby guarantees to keep the row order within groups. That is one of the crucal advantages of having a data frame over just a data base.

1 Like

you could try to see if a for loop is not more suitable for your case

@views for i in 1:length(b)-1
    if b_t[i]==t[i]-1
        b[i]=b[i+1]
    end
end
if b_t[end]==t[end]-1
    b[end]==missing
end
end