Creating one column from multiple DataFrames

I have a data similar to the following in which one column is partly defined in one DataFrame and partly in a second

using DataFrames, DataFramesMeta
# Data situation:
df1 = DataFrame(id1 = [1, 2, 3, 4], id2 = [missing, missing, 'a', 'b'])
df2 = DataFrame(id1 = [1, 2], id2 = ['c', 'd'])

Currently I solve this by joining the DataFrames and then creating a new variable based on the old ones:

# Approach:
df = join(df1, df2, on = :id1, kind = :left, makeunique = true)
df = @transform(df, id2_all = ifelse.(ismissing.(:id2), :id2_1, :id2))
df = select(df, :id1, :id2_all)
rename!(df, :id2_all =>:id_2)

Is there a smarter way of doing this? Thanks!

You can write:

df1.id2[df2.id1] = df2.id2

for an in-place operation.
It will overwrite all values matching :id1 in df2 though (and assumes all df2 is a subset of df1) - I understand this is what you want - your code does something a bit different though.
If you want to reproduce exactly what you have in your question do

idxs = intersect(df1.id1, df2.id1)
v = view(df1.id2, idxs)
v = ifelse.(ismissing.(v), view(df2.id2, idxs), v)

(this is also in-place)

2 Likes

Thanks for your response! I think I oversimplified the example a bit too much. id1 are not actually row indicators but rather some random id numbers and I am not certain the data is sorted correctly so I would like to do some kind of joining to make sure the ids align correctly. Therefore, I came up with the following based on your answer:

idxs = findall(ismissing.(df1.id2))
v = view(df1, idxs, :)
idys = map(x -> findfirst(x .== df2.id1), v.id1)
v.id2 .= df2.id2[idys]

Yes - joining is always tricky :slight_smile:. Happy that in the end you have what you needed.

The crucial point is (and I see you know this, but just want to note this here) that DataFrames.jl supports views and broadcasting that is 100% consistent with Base, so whatever you expect to work in Base should directly translate do data frame context.

2 Likes