Impute by index

Hello!

So I am new to the language, I have been learning now about all the joins, the select, combine, transform, etc.

But there is an issue that I don’t know how to solve without creating a for loop:

Suppose I have an dataset called df_missing:

6×3 DataFrame
│ Row │ id    │ val     │ other |
│     │ Int64 │ Int64?  │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1   │ 5     │ 1       │ 4     │
│ 2   │ 2     │ missing │ 4     │
│ 3   │ 1     │ 3       │ 3     │
│ 4   │ 4     │ 8       │ 4     │
│ 5   │ 6     │ 2       │ 4     │
│ 6   │ 8     │ missing │ 3     │

and I also have another dataset, called df_completion:

2x2 DataFrame
│ Row │ id    │ val   |
│     │ Int64 │ Int64 │ 
├─────┼───────┼───────┤
│ 1   │ 2     │ 5     │ 
│ 2   │ 8     │ 13    │ 

So my question is: What is an easy way to replace the missing values using the common id of the two dataframes, thus creating df_full:

6×3 DataFrame
│ Row │ id    │ val     │ other |
│     │ Int64 │ Int64?  │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1   │ 5     │ 1       │ 4     │
│ 2   │ 2     │ 5       │ 4     │
│ 3   │ 1     │ 3       │ 3     │
│ 4   │ 4     │ 8       │ 4     │
│ 5   │ 6     │ 2       │ 4     │
│ 6   │ 8     │ 13      │ 3     │
1 Like
julia> using DataFrames

julia> df_missing = DataFrame(id = [5,2,1,4,6,8], val = [1, missing, 3,8,2,missing])
6×2 DataFrame
│ Row │ id    │ val     │
│     │ Int64 │ Int64?  │
├─────┼───────┼─────────┤
│ 1   │ 5     │ 1       │
│ 2   │ 2     │ missing │
│ 3   │ 1     │ 3       │
│ 4   │ 4     │ 8       │
│ 5   │ 6     │ 2       │
│ 6   │ 8     │ missing │

julia> df_completion = DataFrame(id = [2, 8], val = [5, 13])
2×2 DataFrame
│ Row │ id    │ val   │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 2     │ 5     │
│ 2   │ 8     │ 13    │

julia> df_missing[in(df_completion.id).(df_missing.id), :val] = df_completion.val
2-element Array{Int64,1}:
  5
 13

julia> df_missing
6×2 DataFrame
│ Row │ id    │ val    │
│     │ Int64 │ Int64? │
├─────┼───────┼────────┤
│ 1   │ 5     │ 1      │
│ 2   │ 2     │ 5      │
│ 3   │ 1     │ 3      │
│ 4   │ 4     │ 8      │
│ 5   │ 6     │ 2      │
│ 6   │ 8     │ 13     │

I see you’re saying “without creating a loop” - note that you absolutely do not have to avoid writing loops in Julia, as Julia loops are not slow like in R/Python but fast like in C. In many instances a loop turns out to be the cleanest, most readable and fastest way of implementing an algorithm in Julia.

2 Likes

Realize that, unlike other languages where loops are slow and programming for critical tasks revolves around clever use of library functions, in Julia a loop is perfectly fine. So, you shouldn’t hesitate to write a loop if that is the simplest or most obvious way to solve a problem.

4 Likes

Thanks, I didn’t know about the loop thing, I will keep that in mind. I don’t understand exactly what the in is doing: why is there a dot between the two parenthesis?

This combines two thing: firstly, I create a Fix2 version of the in function by doing in(df_completion.id) - this essentially creates a function that checks whether it’s argument is in the id column of df_completion. I can use this function like in(df_completion.id)(x) and it will check whether x is in df_completion.id, i.e. it is equivalent to in(x, df_completion.id) (hence Fix2 - we have fixed the second argument to the function to be df_completion and thus turned it into a one argument function.

Now we want to apply this function to all ids in df_missing.id, so we use broadcasting, which is what the dot is doing - so in(df_completion.id).(x) applies the Fix2 version of in to x elementwise.

1 Like

Very nice, thank you.

This seems like a missing functionality, tbh. What you are requesting is something similar to merge data1 data2, update from stata. I am surprised this doesn’t exist.

Can you file an issue with dataframes?

Also, I think the most transparent, though not necessarily efficient, thing to do here is

julia> df_full = leftjoin(df_missing, df_completion, on = :id, makeunique = true);

julia> df_full.val = map(df_full.val, df_full.val_1) do a, b
       ismissing(a) ? b : a
       end;

julia> select!(df_full, Not(:val_1))
6×2 DataFrame
│ Row │ id    │ val   │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 5     │ 1     │
│ 2   │ 2     │ 5     │
│ 3   │ 1     │ 3     │
│ 4   │ 4     │ 8     │
│ 5   │ 6     │ 2     │
│ 6   │ 8     │ 13    │
1 Like

Just did it.