Indexing in dataframes

Hi, This should be straightforward dataframes question, but I have not been able to get my head around it.

I have two dataframes: 1. Describes the relationship between different quantities based on ID and 2. Contains some numerical quantities for all sets of IDs.

Using the first dataframe, I need to iterate through each row and find the corresponding index (row number) in the second dataframe (named Master in the code), to use for a mathematical operation. This is better explained in a MWE

using DataFrames

function generate_data()
    Relationship = DataFrame(ID = ["A1","A2","A4"], LINK_1 = ["A5","A6","A10"],
    LINK_2 = ["A3","A7",""], LINK_3 = ["A9","A11","A12"])
    ID_t =  collect(1:15)
    ID = Vector{String}(undef,15)

    for i in 1:15

    PRICE = rand(10:100,15)
    Master = DataFrame(ID = ID,PRICE = PRICE)   
    return Relationship, Master

Relationship, Master = generate_data()
    x = Master.PRICE
    y = Vector{Any}(undef,length(Relationship.ID))
    # Use a loop that iterates through each row of the Relationship dataframe,
    # finds index (row number) of the corresponding ID(s) in columns (LINK_1, etc.) in the Master dataframe
    # and performs some mathematical operation based on it and store the results in a vector y.
    # Example of mathematical operation
    # y[1] = x[1]+x[5]+x[3]+x[9]
    # y[2] = x[2]+x[6] + x[7]+x[11]
    # y[3] = x[4] + x[10]+x[12]

1. How can I find the corresponding indices in the Master dataframe based on the criteria of relationships set out in the Relationship dataframe?
2. Ideally I would like to use a loop to iterate and then apply some mathematical operation such as that shown in comments in the code.

Why not use a join?

If I use join(Relationship,Master, kind =:outer, on = intersect(names(Relationsip), names(Master))) This creates a dataframe combining columns from both dataframes. I am struggling to understand how to find indices of the IDs listed in columns LINK_1 to LINK_3 in the Master dataframe. For example in the first row A1 is linked to A5, A3 and A9. Indices for them are 5, 3 and 9.
Basically I want to do the type of calculation written in comments in the code. Obviously not manually, but using a loop or something where I read the relevant indices and store the results of the calculation in an array y.

I don’t mind how to get to that step, whether using join or something else. My knowledge of dataframes is not great, hence the question!

A DataFrame probably isn’t the right object for Master. Better to use a Dict instead

julia> d = Dict([Master.ID[i] => Master.PRICE[i] for i in 1:nrow(Master)]);

julia> function add_all(row_strs, d)
           sum(get.(Ref(d), row_strs, 0))

julia> map(eachrow(Relationship)) do r
           add_all(collect(r[2:end]), d)
3-element Array{Int64,1}:

Thank you!

Can we do the same operation but using a loop and without the functions add_all and map? The reason being it will give me flexibility on the type of operation I want to do, such as multiplication, subtraction, addition, etc, whereas the add_all function is specialised for summation.

For example, a loop like this

for i in eachrow(Relationship)
   y[i] = sum(x[i] for i in ....) 

Note: The loop is not written correctly as I don’t know in this case how to use it.

It sounds like you could benefit from reading the manual a bit more. In add_all, you could just keep get.(Ref(d), row_strs, 0) and then work with that array in however way you see fit.

1 Like