Impute by index

Hello!

So I am new to the language, I have been learning now about all the joins, the select, combine, transform, etc.

But there is an issue that I don’t know how to solve without creating a for loop:

Suppose I have an dataset called df_missing:

6×3 DataFrame
│ Row │ id    │ val     │ other |
│     │ Int64 │ Int64?  │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1   │ 5     │ 1       │ 4     │
│ 2   │ 2     │ missing │ 4     │
│ 3   │ 1     │ 3       │ 3     │
│ 4   │ 4     │ 8       │ 4     │
│ 5   │ 6     │ 2       │ 4     │
│ 6   │ 8     │ missing │ 3     │

and I also have another dataset, called df_completion:

2x2 DataFrame
│ Row │ id    │ val   |
│     │ Int64 │ Int64 │ 
├─────┼───────┼───────┤
│ 1   │ 2     │ 5     │ 
│ 2   │ 8     │ 13    │ 

So my question is: What is an easy way to replace the missing values using the common id of the two dataframes, thus creating df_full:

6×3 DataFrame
│ Row │ id    │ val     │ other |
│     │ Int64 │ Int64?  │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1   │ 5     │ 1       │ 4     │
│ 2   │ 2     │ 5       │ 4     │
│ 3   │ 1     │ 3       │ 3     │
│ 4   │ 4     │ 8       │ 4     │
│ 5   │ 6     │ 2       │ 4     │
│ 6   │ 8     │ 13      │ 3     │
julia> using DataFrames

julia> df_missing = DataFrame(id = [5,2,1,4,6,8], val = [1, missing, 3,8,2,missing])
6×2 DataFrame
│ Row │ id    │ val     │
│     │ Int64 │ Int64?  │
├─────┼───────┼─────────┤
│ 1   │ 5     │ 1       │
│ 2   │ 2     │ missing │
│ 3   │ 1     │ 3       │
│ 4   │ 4     │ 8       │
│ 5   │ 6     │ 2       │
│ 6   │ 8     │ missing │

julia> df_completion = DataFrame(id = [2, 8], val = [5, 13])
2×2 DataFrame
│ Row │ id    │ val   │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 2     │ 5     │
│ 2   │ 8     │ 13    │

julia> df_missing[in(df_completion.id).(df_missing.id), :val] = df_completion.val
2-element Array{Int64,1}:
  5
 13

julia> df_missing
6×2 DataFrame
│ Row │ id    │ val    │
│     │ Int64 │ Int64? │
├─────┼───────┼────────┤
│ 1   │ 5     │ 1      │
│ 2   │ 2     │ 5      │
│ 3   │ 1     │ 3      │
│ 4   │ 4     │ 8      │
│ 5   │ 6     │ 2      │
│ 6   │ 8     │ 13     │

I see you’re saying “without creating a loop” - note that you absolutely do not have to avoid writing loops in Julia, as Julia loops are not slow like in R/Python but fast like in C. In many instances a loop turns out to be the cleanest, most readable and fastest way of implementing an algorithm in Julia.

Realize that, unlike other languages where loops are slow and programming for critical tasks revolves around clever use of library functions, in Julia a loop is perfectly fine. So, you shouldn’t hesitate to write a loop if that is the simplest or most obvious way to solve a problem.

Thanks, I didn’t know about the loop thing, I will keep that in mind. I don’t understand exactly what the in is doing: why is there a dot between the two parenthesis?

This combines two thing: firstly, I create a Fix2 version of the in function by doing in(df_completion.id) - this essentially creates a function that checks whether it’s argument is in the id column of df_completion. I can use this function like in(df_completion.id)(x) and it will check whether x is in df_completion.id, i.e. it is equivalent to in(x, df_completion.id) (hence Fix2 - we have fixed the second argument to the function to be df_completion and thus turned it into a one argument function.

Now we want to apply this function to all ids in df_missing.id, so we use broadcasting, which is what the dot is doing - so in(df_completion.id).(x) applies the Fix2 version of in to x elementwise.

Very nice, thank you.

This seems like a missing functionality, tbh. What you are requesting is something similar to merge data1 data2, update from stata. I am surprised this doesn’t exist.

Can you file an issue with dataframes?

Also, I think the most transparent, though not necessarily efficient, thing to do here is

julia> df_full = leftjoin(df_missing, df_completion, on = :id, makeunique = true);

julia> df_full.val = map(df_full.val, df_full.val_1) do a, b
       ismissing(a) ? b : a
       end;

julia> select!(df_full, Not(:val_1))
6×2 DataFrame
│ Row │ id    │ val   │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 5     │ 1     │
│ 2   │ 2     │ 5     │
│ 3   │ 1     │ 3     │
│ 4   │ 4     │ 8     │
│ 5   │ 6     │ 2     │
│ 6   │ 8     │ 13    │

Just did it.