Vcat DataFrame columns based on multiple columns in Julia

I have 3 DataFrames, each containing 3 columns: A, B, and C.

using DataFrames    
common_data = Dict("A" => [1, 2, 3], "B" => [10, 20, 30])
df1 = DataFrame(merge(common_data, Dict("C" => [100, 200, 300])))
df2 = DataFrame(merge(common_data, Dict("C" => [400, 500, 600])))
df3 = DataFrame(merge(common_data, Dict("C" => [700, 800, 900])))

I consider columns A and B as indices and want to perform an inner join on 3 DataFrames based on column C. This should be done only when the values in columns A and B of each DataFrame are the same. The column of the final output should be [A,B,C_df1,C_df2,C_df3] . How can I achieve this?

If I understood correctly what you want, something like that might help:

tmp = innerjoin(df1,df2, on = [:A,:B], makeunique = true)
res = innerjoin(tmp, df3, on = [:A,:B], makeunique = true)

I’m on the phone so take with a grain of salt but you can do something like

reduce((x, y) -> innerjoin(x, y, on = [:A, :B]), (df1, df2, df3))
julia> innerjoin(df1,df2,df3, on=[:A,:B], makeunique=true)
3Γ—5 DataFrame
 Row β”‚ A      B      C      C_1    C_2   
     β”‚ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 β”‚     1     10    100    400    700
   2 β”‚     2     20    200    500    800
   3 β”‚     3     30    300    600    900
1 Like
renamecols : a Pair specifying how columns of left and right data frames should be renamed in the
       resulting data frame. Each element of the pair can be a string or a Symbol can be passed in which case       
       it is appended to the original column name; alternatively a function can be passed in which case it is       
       applied to each column name, which is passed to it as a String. Note that renamecols does not affect
       on columns, whose names are always taken from the left data frame and left unchanged.

The renamecols keyword allows some editing of the output column names, but, it seems, only in the case of two dataframes.
I can’t find any indication for a use in the case of more than 2 dataframes.
In the case in question it would be useful to be able to apply a function that uses, for example, the name of the dataframes to be intertwined to distinguish the 3 columns β€œC”

This is the same question as on SO, where I answered what can be done.

I can’t find any indication for a use in the case of more than 2 dataframes.

We decided not to add it, as there was no easy way to specify the renamecols argument for more than two tables. Now, thinking of it, maybe we could allow passing a tuple of values? If this is something that would be useful, can you please open an issue.

Note that for the same reason when joining more than two data frames one cannot separately specify the on keyword argument columns for multiple tables (which is allowed for two tables case). Again - we could discuss if this is really needed.

My initial thinking was that when joining more than 2 tables one really should pre-process the tables.

Also, in general, if there are only 3 tables one can run two inner-joins operations in sequence with renamecols passed (by extension of what @Rudi79 proposed above).

1 Like

Totally agree.
I don’t think it’s a function so much in demand as to justify the complications of development and also in use.