A lagged table-join for finding the last ten rows that meets a criterion, for each row

Cong · January 4, 2024, 5:04pm

I have a dataframe like below. I am trying to add 6 columns to it: Out1Bac1, Out1Bac2, Out1Bac3, Out2Bac1, Out2Bac2, Out2Bac3.
For each row, find the last row whose abschoice equals the CP1 of the current row, and get the unitReward of that row to fill into the Out1Bac1 column of the current row; then find the second last row whose abschoice equals the CP1 of the current row, and get the unitReward of that row to fill into the Out1Bac2 column of the current row; then find the third last row whose abschoice equals the CP1 of the current row, and get the unitReward of that row to fill into the Out1Bac3 column of the current row. and the same for CP2 and Out2Bac1, Out2Bac2, Out2Bac3.
195×4 DataFrame
Row │ CP1 CP2 abschoice unitReward
│ String String String? Int16
─────┼───────────────────────────────────────
1 │ A D D 4
2 │ A D D 3
3 │ A A A 2
4 │ A D D 0
5 │ A D D 3
6 │ A D D 6
7 │ A B A 3
8 │ A A A 1
9 │ A B A 2
10 │ A B A 2
Currently I am trying to do this (in a chain, so don’t worry about specifying the df or adding !):
@rtransform(:Out2Bac1 = lag(:abschoice, 1) .== :CP2 ? lag(:unitReward, 1) : lag(:abschoice, 2) .== :CP2 ? lag(:unitReward, 1) : missing)
but
first: Julia complains that no method matching lag(::String, ::Int64),
second (the real problem): this syntax requires me to manually write out how many rows I am willing to search back, one by one, which is pretty ugly if I want to search, say, 30 rows back,
third (the real tricky problem): this syntax doesn’t allow me to find the second last or the third last row that satisfies that criterion easily. So maybe a better way is to find at once the last 3 rows whose abschoice == CP1 of the current row, then fill the 3 unitReward values from that 3 rows, into the 3 columns Out1Bac1, Out1Bac2, Out1Bac3 for the current row. — I have no clue how to do that – sounds like I need a join together with lag??

Dan · January 4, 2024, 6:19pm

First,

seems to be because of @rtransform which transforms row by row. lag wants to work on a vector. You can see this in the error message which complains lag received a String as first argument instead of a Vector{String}.

As to the broader question… that’s a little more complicated. It would also be nice if you gave a mini-example with the desired output DataFrame (although it is somewhat comprehensible even in the current post).

Dan · January 4, 2024, 6:58pm

Again, IIUC the question, this is one way:

function populateOutBac!(df)
    d = Dict{eltype(df.abschoice),Vector{eltype(df.unitReward)}}()
    df.Out1Bac1 = allowmissing(similar(df.unitReward))
    df.Out1Bac2 = allowmissing(similar(df.unitReward))
    df.Out1Bac3 = allowmissing(similar(df.unitReward))
    df.Out2Bac1 = allowmissing(similar(df.unitReward))
    df.Out2Bac2 = allowmissing(similar(df.unitReward))
    df.Out2Bac3 = allowmissing(similar(df.unitReward))
    for r in eachrow(df)
        w = get!(()->(eltype(df.unitReward)[]), d, r.CP1)
        z = get!(()->(eltype(df.unitReward)[]), d, r.CP2)
        r.Out1Bac1, r.Out1Bac2, r.Out1Bac3 = get.(Ref(w), 1:3, missing)
        r.Out2Bac1, r.Out2Bac2, r.Out2Bac3 = get.(Ref(z), 1:3, missing)
        v = get!(()->(eltype(df.unitReward)[]), d, r.abschoice)
        if length(v) == 3
            circshift!(v, 1)
            v[1] = r.unitReward
        else
            pushfirst!(v, r.unitReward)
        end
    end
end

which gives for the partial DataFrame in the question:

10×10 DataFrame
 Row │ CP1      CP2      abschoice  unitReward  Out1Bac1  Out1Bac2  Out1Bac3  Out2Bac1  Out2Bac2  Out2Bac3 
     │ String1  String1  String1    Int64       Int64?    Int64?    Int64?    Int64?    Int64?    Int64?   
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ A        D        D                   4   missing   missing   missing   missing   missing   missing 
   2 │ A        D        D                   3   missing   missing   missing         4   missing   missing 
   3 │ A        A        A                   2   missing   missing   missing   missing   missing   missing 
   4 │ A        D        D                   0         2   missing   missing         3         4   missing 
   5 │ A        D        D                   3         2   missing   missing         0         3         4
   6 │ A        D        D                   6         2   missing   missing         3         0         3
   7 │ A        B        A                   3         2   missing   missing   missing   missing   missing 
   8 │ A        A        A                   1         3         2   missing         3         2   missing 
   9 │ A        B        A                   2         1         3         2   missing   missing   missing 
  10 │ A        B        A                   2         2         1         3   missing   missing   missing

Any good?

Cong · January 4, 2024, 7:49pm

This is the desired output dataframe – which is exactly what you got!:

and the thing I want to do is: e.g. for the 5th row, CP1 == A, I go back to find the rows that have abschoice == A: row 3. then get the corresponding unitReward from the row: 2. then fill it in the 5th row: fill 2 in CP1Bac1 ; couldn’t find more rows that have abschoice ==A, so fill missing in CP1Bac2 , missing in CP1Bac3; CP2 == D for the 5th row, so I go back and find row 4,2,1 that have abschoice == D, the unitReward 0,3,4. so fill them in Out2Back1, Out2Back2, Out2Back3 respectively.

Cong · January 4, 2024, 8:25pm

Thank you so much!! This is exactly what I wanted. Could you explain your code a bit? d = Dict() creates an empty dictionary of certain data type. but I couldn’t tell where you actually fill the dictionary with the vector of df.abschoice and df.unitReward.

Dan · January 4, 2024, 8:36pm

Glad to be of assistance. Some more explanation to the code:

function populateOutBac!(df)
# Define a dictionary to hold the recent `unitReward` for each 
# choice:
    d = Dict{eltype(df.abschoice),Vector{eltype(df.unitReward)}}()
# Add the missing columns to the DataFrame, with type as 
# reward and allowing `missing` values:
    df.Out1Bac1 = allowmissing(similar(df.unitReward))
    df.Out1Bac2 = allowmissing(similar(df.unitReward))
    df.Out1Bac3 = allowmissing(similar(df.unitReward))
    df.Out2Bac1 = allowmissing(similar(df.unitReward))
    df.Out2Bac2 = allowmissing(similar(df.unitReward))
    df.Out2Bac3 = allowmissing(similar(df.unitReward))
# Go over each row:
    for r in eachrow(df)
# Get the recent rewards for CP1 and CP2 stored in Dict `d` (if 
# there isn't an entry, generate an empty vector for that value:
        w = get!(()->(eltype(df.unitReward)[]), d, r.CP1)
        z = get!(()->(eltype(df.unitReward)[]), d, r.CP2)
# Use broadcasting and default value feature of `get` to get 
# previous value or `missing`:
        r.Out1Bac1, r.Out1Bac2, r.Out1Bac3 = get.(Ref(w), 1:3, missing)
        r.Out2Bac1, r.Out2Bac2, r.Out2Bac3 = get.(Ref(z), 1:3, missing)
# Push the new reward into the Dict `d`, by first getting the right 
# entry:
        v = get!(()->(eltype(df.unitReward)[]), d, r.abschoice)
# If entry already has 3 previous values, cycle and overwrite the 
# oldest (now in the first position):
        if length(v) == 3
            circshift!(v, 1)
            v[1] = r.unitReward
        else
            pushfirst!(v, r.unitReward)
        end
    end
end

That’s it.
There should be prettier ways of doing it, and maybe someone will chime in. But to help this happen, it helps to have cut-and-paste option for testing:

iob = IOBuffer("""1 │ A D D 4
       2 │ A D D 3
       3 │ A A A 2
       4 │ A D D 0
       5 │ A D D 3
       6 │ A D D 6
       7 │ A B A 3
       8 │ A A A 1
       9 │ A B A 2
       10 │ A B A 2""");
df = CSV.read(iob, DataFrame; 
  header=["rownum", "tmp", "CP1", "CP2", "abschoice", "unitReward"]);
select!(df, Not([:tmp, :rownum]))

to create df for processing with above populateOutBac!(df) or some other way.

rocco_sprmnt21 · January 4, 2024, 10:54pm

This isn’t very elegant but, perhaps, it’s easier to follow

df= CSV.read("tablej.csv",DataFrame)
insertcols!(df,1,:row=>1:nrow(df))

function tabjoin(r,cp1)
    v=findall(==(cp1),df[1:r-1,4])
    coal= length(v)==0 ? [missing,missing,missing] :
    length(v)==1 ? [missing,missing, df[v[1],5]] :
    length(v)==2 ? [missing, df[v[1],5],df[v[2],5]] :
    df[v[end-2:end],5]
    (;zip([:O1B1,:O1B2,:O1B3],reverse(coal))...)
end

df1=transform(df, [1,2]=>ByRow(tabjoin)=>AsTable)

Topic		Replies	Views
Transforming a daily DataFrame with missing values into a DataFrame with end-of-month values General Usage dataframes	11	183	November 22, 2024
Delete missing values after the last non missing value in each id New to Julia dataframes	7	558	September 1, 2022
Creating one column from another New to Julia	5	308	September 8, 2022
Using previous row values to create values for a new column New to Julia dataframes	4	2651	March 20, 2021
Function like VLOOKUP General Usage question , package , dataframes	19	2331	May 27, 2020

A lagged table-join for finding the last ten rows that meets a criterion, for each row

Related topics