Need for speed: looping over subdataframes to construct lags

moshi · March 17, 2023, 6:15pm

I have a very large DataFrame with columns j, t, b, b_t.

I am creating a new column s by df.s = df.b
and then I am making changes to the values in s depending on certain conditions of whether b_t and t differ, for each group j. The following code shows the changes I am making. It works but it takes forever:

for g in groupby(df, :j)
   g.s = ifelse.(g.b_t .== g.t .- 1, lag(g.b, -1), g.b )
end

The lag function uses ShiftedArrays package.

Even putting the above loop in a function, and making it available @everywhere with using Distributed to parallelize it, only has modest gains in speed. I am wondering is there an easy efficiency gain that I am missing here. Thanks!

bkamins · March 17, 2023, 7:41pm

First comment is that df.s = df.b is not a good pattern. It creates alias of :s and :b. They have the same memory location. Do transform!(df, :b => :s) or df.s = copy(df.b) to ensure you have a copy.

Having said this it is likely this is not needed at all as it should be enough to just write:

transform!(groupby(df, :j), [:b, :t, :b_t] => ((b, t, b_t) -> ifelse.(b_t .== t .- 1, lag(b, -1), b)) => :s)

It is possible to further improve the performance as this solution does some unnecessary allocations, but maybe this is already good enough.

moshi · March 17, 2023, 11:18pm

Thanks so much @bkamins ! It improved speed. And also thanks for correcting my sloppy assignment of b !

Dan · March 18, 2023, 1:27am

Can you clarify what is the desired result here? The lag function depends on the ordering of rows. Yet a database relation has no record order. A DataFrame does have row order, but it is implicit with DataFrame construction, and the groupby guarantee of order should be looked up in the docs or code or @bkamins. If b_t and t are “time” ordering within groups, perhaps somehow sorting the DataFrame can make this whole operation quicker.

moshi · March 18, 2023, 2:37am

Yes they are time variables. The data was already sorted. I did not mention that small detail because if not, the exercise would have been incorrect regardless.

bkamins · March 18, 2023, 11:16am

groupby guarantees to keep the row order within groups. That is one of the crucal advantages of having a data frame over just a data base.

rocco_sprmnt21 · March 18, 2023, 11:25am

you could try to see if a for loop is not more suitable for your case

@views for i in 1:length(b)-1
    if b_t[i]==t[i]-1
        b[i]=b[i+1]
    end
end
if b_t[end]==t[end]-1
    b[end]==missing
end
end

Topic		Replies	Views
Lag vector by group using another vector as the grouping variable General Usage question	4	337	February 20, 2022
Apply transform conditioning on laged values General Usage dataframes	1	137	February 6, 2024
Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates? New to Julia question , dataframes , grouped-data	14	707	March 29, 2023
Using previous row values to create values for a new column New to Julia dataframes	4	2651	March 20, 2021
Looping over previous row efficiency New to Julia loops , dataframes	1	432	February 3, 2022

Need for speed: looping over subdataframes to construct lags

Related topics