I have a very large DataFrame with columns j, t, b, b_t.
I am creating a new column s by df.s = df.b
and then I am making changes to the values in s depending on certain conditions of whether b_t and t differ, for each group j. The following code shows the changes I am making. It works but it takes forever:
for g in groupby(df, :j)
g.s = ifelse.(g.b_t .== g.t .- 1, lag(g.b, -1), g.b )
The lag function uses ShiftedArrays package.
Even putting the above loop in a function, and making it available @everywhere with using Distributed to parallelize it, only has modest gains in speed. I am wondering is there an easy efficiency gain that I am missing here. Thanks!
Can you clarify what is the desired result here? The lag function depends on the ordering of rows. Yet a database relation has no record order. A DataFrame does have row order, but it is implicit with DataFrame construction, and the groupby guarantee of order should be looked up in the docs or code or @bkamins. If b_t and t are “time” ordering within groups, perhaps somehow sorting the DataFrame can make this whole operation quicker.