I have a very large DataFrame with columns j, t, b, b_t.
I am creating a new column s by df.s = df.b
and then I am making changes to the values in s depending on certain conditions of whether b_t and t differ, for each group j. The following code shows the changes I am making. It works but it takes forever:
for g in groupby(df, :j)
g.s = ifelse.(g.b_t .== g.t .- 1, lag(g.b, -1), g.b )
end
The lag function uses ShiftedArrays package.
Even putting the above loop in a function, and making it available @everywhere with using Distributed to parallelize it, only has modest gains in speed. I am wondering is there an easy efficiency gain that I am missing here. Thanks!
First comment is that df.s = df.b is not a good pattern. It creates alias of :s and :b. They have the same memory location. Do transform!(df, :b => :s) or df.s = copy(df.b) to ensure you have a copy.
Having said this it is likely this is not needed at all as it should be enough to just write:
Can you clarify what is the desired result here? The lag function depends on the ordering of rows. Yet a database relation has no record order. A DataFrame does have row order, but it is implicit with DataFrame construction, and the groupby guarantee of order should be looked up in the docs or code or @bkamins. If b_t and t are “time” ordering within groups, perhaps somehow sorting the DataFrame can make this whole operation quicker.
Yes they are time variables. The data was already sorted. I did not mention that small detail because if not, the exercise would have been incorrect regardless.