I have a dataframe like below. I am trying to add 6 columns to it: Out1Bac1, Out1Bac2, Out1Bac3, Out2Bac1, Out2Bac2, Out2Bac3
.
For each row, find the last row whose abschoice
equals the CP1
of the current row, and get the unitReward
of that row to fill into the Out1Bac1
column of the current row; then find the second last row whose abschoice
equals the CP1
of the current row, and get the unitReward
of that row to fill into the Out1Bac2
column of the current row; then find the third last row whose abschoice
equals the CP1
of the current row, and get the unitReward
of that row to fill into the Out1Bac3
column of the current row. and the same for CP2
and Out2Bac1, Out2Bac2, Out2Bac3
.
195Γ4 DataFrame
Row β CP1 CP2 abschoice unitReward
β String String String? Int16
ββββββΌβββββββββββββββββββββββββββββββββββββββ
1 β A D D 4
2 β A D D 3
3 β A A A 2
4 β A D D 0
5 β A D D 3
6 β A D D 6
7 β A B A 3
8 β A A A 1
9 β A B A 2
10 β A B A 2
Currently I am trying to do this (in a chain, so donβt worry about specifying the df or adding !):
@rtransform(:Out2Bac1 = lag(:abschoice, 1) .== :CP2 ? lag(:unitReward, 1) : lag(:abschoice, 2) .== :CP2 ? lag(:unitReward, 1) : missing)
but
first: Julia complains that no method matching lag(::String, ::Int64)
,
second (the real problem): this syntax requires me to manually write out how many rows I am willing to search back, one by one, which is pretty ugly if I want to search, say, 30 rows back,
third (the real tricky problem): this syntax doesnβt allow me to find the second last or the third last row that satisfies that criterion easily. So maybe a better way is to find at once the last 3 rows whose abschoice
== CP1
of the current row, then fill the 3 unitReward
values from that 3 rows, into the 3 columns Out1Bac1, Out1Bac2, Out1Bac3
for the current row. β I have no clue how to do that β sounds like I need a join
together with lag
??
First,
seems to be because of @rtransform
which transforms row by row. lag
wants to work on a vector. You can see this in the error message which complains lag
received a String
as first argument instead of a Vector{String}
.
As to the broader questionβ¦ thatβs a little more complicated. It would also be nice if you gave a mini-example with the desired output DataFrame (although it is somewhat comprehensible even in the current post).
Again, IIUC the question, this is one way:
function populateOutBac!(df)
d = Dict{eltype(df.abschoice),Vector{eltype(df.unitReward)}}()
df.Out1Bac1 = allowmissing(similar(df.unitReward))
df.Out1Bac2 = allowmissing(similar(df.unitReward))
df.Out1Bac3 = allowmissing(similar(df.unitReward))
df.Out2Bac1 = allowmissing(similar(df.unitReward))
df.Out2Bac2 = allowmissing(similar(df.unitReward))
df.Out2Bac3 = allowmissing(similar(df.unitReward))
for r in eachrow(df)
w = get!(()->(eltype(df.unitReward)[]), d, r.CP1)
z = get!(()->(eltype(df.unitReward)[]), d, r.CP2)
r.Out1Bac1, r.Out1Bac2, r.Out1Bac3 = get.(Ref(w), 1:3, missing)
r.Out2Bac1, r.Out2Bac2, r.Out2Bac3 = get.(Ref(z), 1:3, missing)
v = get!(()->(eltype(df.unitReward)[]), d, r.abschoice)
if length(v) == 3
circshift!(v, 1)
v[1] = r.unitReward
else
pushfirst!(v, r.unitReward)
end
end
end
which gives for the partial DataFrame in the question:
10Γ10 DataFrame
Row β CP1 CP2 abschoice unitReward Out1Bac1 Out1Bac2 Out1Bac3 Out2Bac1 Out2Bac2 Out2Bac3
β String1 String1 String1 Int64 Int64? Int64? Int64? Int64? Int64? Int64?
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β A D D 4 missing missing missing missing missing missing
2 β A D D 3 missing missing missing 4 missing missing
3 β A A A 2 missing missing missing missing missing missing
4 β A D D 0 2 missing missing 3 4 missing
5 β A D D 3 2 missing missing 0 3 4
6 β A D D 6 2 missing missing 3 0 3
7 β A B A 3 2 missing missing missing missing missing
8 β A A A 1 3 2 missing 3 2 missing
9 β A B A 2 1 3 2 missing missing missing
10 β A B A 2 2 1 3 missing missing missing
Any good?
This is the desired output dataframe β which is exactly what you got!:
and the thing I want to do is: e.g. for the 5th row, CP1 == A, I go back to find the rows that have abschoice == A: row 3. then get the corresponding unitReward from the row: 2. then fill it in the 5th row: fill 2 in
CP1Bac1
; couldnβt find more rows that have abschoice ==A, so fill missing in CP1Bac2
, missing in CP1Bac3
; CP2 == D for the 5th row, so I go back and find row 4,2,1 that have abschoice == D, the unitReward 0,3,4. so fill them in Out2Back1, Out2Back2, Out2Back3 respectively.
Thank you so much!! This is exactly what I wanted. Could you explain your code a bit? d = Dict() creates an empty dictionary of certain data type. but I couldnβt tell where you actually fill the dictionary with the vector of df.abschoice and df.unitReward.
Glad to be of assistance. Some more explanation to the code:
function populateOutBac!(df)
# Define a dictionary to hold the recent `unitReward` for each
# choice:
d = Dict{eltype(df.abschoice),Vector{eltype(df.unitReward)}}()
# Add the missing columns to the DataFrame, with type as
# reward and allowing `missing` values:
df.Out1Bac1 = allowmissing(similar(df.unitReward))
df.Out1Bac2 = allowmissing(similar(df.unitReward))
df.Out1Bac3 = allowmissing(similar(df.unitReward))
df.Out2Bac1 = allowmissing(similar(df.unitReward))
df.Out2Bac2 = allowmissing(similar(df.unitReward))
df.Out2Bac3 = allowmissing(similar(df.unitReward))
# Go over each row:
for r in eachrow(df)
# Get the recent rewards for CP1 and CP2 stored in Dict `d` (if
# there isn't an entry, generate an empty vector for that value:
w = get!(()->(eltype(df.unitReward)[]), d, r.CP1)
z = get!(()->(eltype(df.unitReward)[]), d, r.CP2)
# Use broadcasting and default value feature of `get` to get
# previous value or `missing`:
r.Out1Bac1, r.Out1Bac2, r.Out1Bac3 = get.(Ref(w), 1:3, missing)
r.Out2Bac1, r.Out2Bac2, r.Out2Bac3 = get.(Ref(z), 1:3, missing)
# Push the new reward into the Dict `d`, by first getting the right
# entry:
v = get!(()->(eltype(df.unitReward)[]), d, r.abschoice)
# If entry already has 3 previous values, cycle and overwrite the
# oldest (now in the first position):
if length(v) == 3
circshift!(v, 1)
v[1] = r.unitReward
else
pushfirst!(v, r.unitReward)
end
end
end
Thatβs it.
There should be prettier ways of doing it, and maybe someone will chime in. But to help this happen, it helps to have cut-and-paste option for testing:
iob = IOBuffer("""1 β A D D 4
2 β A D D 3
3 β A A A 2
4 β A D D 0
5 β A D D 3
6 β A D D 6
7 β A B A 3
8 β A A A 1
9 β A B A 2
10 β A B A 2""");
df = CSV.read(iob, DataFrame;
header=["rownum", "tmp", "CP1", "CP2", "abschoice", "unitReward"]);
select!(df, Not([:tmp, :rownum]))
to create df
for processing with above populateOutBac!(df)
or some other way.
This isnβt very elegant but, perhaps, itβs easier to follow
df= CSV.read("tablej.csv",DataFrame)
insertcols!(df,1,:row=>1:nrow(df))
function tabjoin(r,cp1)
v=findall(==(cp1),df[1:r-1,4])
coal= length(v)==0 ? [missing,missing,missing] :
length(v)==1 ? [missing,missing, df[v[1],5]] :
length(v)==2 ? [missing, df[v[1],5],df[v[2],5]] :
df[v[end-2:end],5]
(;zip([:O1B1,:O1B2,:O1B3],reverse(coal))...)
end
df1=transform(df, [1,2]=>ByRow(tabjoin)=>AsTable)