Hi, This should be straightforward dataframes question, but I have not been able to get my head around it.
I have two dataframes: 1. Describes the relationship between different quantities based on ID and 2. Contains some numerical quantities for all sets of IDs.
Using the first dataframe, I need to iterate through each row and find the corresponding index (row number) in the second dataframe (named Master in the code), to use for a mathematical operation. This is better explained in a MWE
using DataFrames
function generate_data()
Relationship = DataFrame(ID = ["A1","A2","A4"], LINK_1 = ["A5","A6","A10"],
LINK_2 = ["A3","A7",""], LINK_3 = ["A9","A11","A12"])
ID_t = collect(1:15)
ID = Vector{String}(undef,15)
for i in 1:15
ID[i]="A".*string(ID_t[i])
end
PRICE = rand(10:100,15)
Master = DataFrame(ID = ID,PRICE = PRICE)
return Relationship, Master
end
Relationship, Master = generate_data()
x = Master.PRICE
y = Vector{Any}(undef,length(Relationship.ID))
# Use a loop that iterates through each row of the Relationship dataframe,
# finds index (row number) of the corresponding ID(s) in columns (LINK_1, etc.) in the Master dataframe
# and performs some mathematical operation based on it and store the results in a vector y.
# Example of mathematical operation
# y[1] = x[1]+x[5]+x[3]+x[9]
# y[2] = x[2]+x[6] + x[7]+x[11]
# y[3] = x[4] + x[10]+x[12]
1. How can I find the corresponding indices in theMasterdataframe based on the criteria of relationships set out in theRelationshipdataframe? 2. Ideally I would like to use a loop to iterate and then apply some mathematical operation such as that shown in comments in the code.
If I use join(Relationship,Master, kind =:outer, on = intersect(names(Relationsip), names(Master))) This creates a dataframe combining columns from both dataframes. I am struggling to understand how to find indices of the IDs listed in columns LINK_1 to LINK_3 in the Master dataframe. For example in the first row A1 is linked to A5, A3 and A9. Indices for them are 5, 3 and 9.
Basically I want to do the type of calculation written in comments in the code. Obviously not manually, but using a loop or something where I read the relevant indices and store the results of the calculation in an array y.
I don’t mind how to get to that step, whether using join or something else. My knowledge of dataframes is not great, hence the question!
A DataFrame probably isn’t the right object for Master. Better to use a Dict instead
julia> d = Dict([Master.ID[i] => Master.PRICE[i] for i in 1:nrow(Master)]);
julia> function add_all(row_strs, d)
sum(get.(Ref(d), row_strs, 0))
end;
julia> map(eachrow(Relationship)) do r
add_all(collect(r[2:end]), d)
end
3-element Array{Int64,1}:
216
220
78
Can we do the same operation but using a loop and without the functions add_all and map? The reason being it will give me flexibility on the type of operation I want to do, such as multiplication, subtraction, addition, etc, whereas the add_all function is specialised for summation.
For example, a loop like this
for i in eachrow(Relationship)
y[i] = sum(x[i] for i in ....)
end
Note: The loop is not written correctly as I don’t know in this case how to use it.
It sounds like you could benefit from reading the manual a bit more. In add_all, you could just keep get.(Ref(d), row_strs, 0) and then work with that array in however way you see fit.