Hello!
So I am new to the language, I have been learning now about all the joins
, the select, combine, transform, etc.
But there is an issue that I don’t know how to solve without creating a for loop:
Suppose I have an dataset called df_missing
:
6×3 DataFrame
│ Row │ id │ val │ other |
│ │ Int64 │ Int64? │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1 │ 5 │ 1 │ 4 │
│ 2 │ 2 │ missing │ 4 │
│ 3 │ 1 │ 3 │ 3 │
│ 4 │ 4 │ 8 │ 4 │
│ 5 │ 6 │ 2 │ 4 │
│ 6 │ 8 │ missing │ 3 │
and I also have another dataset, called df_completion
:
2x2 DataFrame
│ Row │ id │ val |
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 2 │ 5 │
│ 2 │ 8 │ 13 │
So my question is: What is an easy way to replace the missing values using the common id
of the two dataframes, thus creating df_full
:
6×3 DataFrame
│ Row │ id │ val │ other |
│ │ Int64 │ Int64? │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1 │ 5 │ 1 │ 4 │
│ 2 │ 2 │ 5 │ 4 │
│ 3 │ 1 │ 3 │ 3 │
│ 4 │ 4 │ 8 │ 4 │
│ 5 │ 6 │ 2 │ 4 │
│ 6 │ 8 │ 13 │ 3 │
1 Like
julia> using DataFrames
julia> df_missing = DataFrame(id = [5,2,1,4,6,8], val = [1, missing, 3,8,2,missing])
6×2 DataFrame
│ Row │ id │ val │
│ │ Int64 │ Int64? │
├─────┼───────┼─────────┤
│ 1 │ 5 │ 1 │
│ 2 │ 2 │ missing │
│ 3 │ 1 │ 3 │
│ 4 │ 4 │ 8 │
│ 5 │ 6 │ 2 │
│ 6 │ 8 │ missing │
julia> df_completion = DataFrame(id = [2, 8], val = [5, 13])
2×2 DataFrame
│ Row │ id │ val │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 2 │ 5 │
│ 2 │ 8 │ 13 │
julia> df_missing[in(df_completion.id).(df_missing.id), :val] = df_completion.val
2-element Array{Int64,1}:
5
13
julia> df_missing
6×2 DataFrame
│ Row │ id │ val │
│ │ Int64 │ Int64? │
├─────┼───────┼────────┤
│ 1 │ 5 │ 1 │
│ 2 │ 2 │ 5 │
│ 3 │ 1 │ 3 │
│ 4 │ 4 │ 8 │
│ 5 │ 6 │ 2 │
│ 6 │ 8 │ 13 │
I see you’re saying “without creating a loop” - note that you absolutely do not have to avoid writing loops in Julia, as Julia loops are not slow like in R/Python but fast like in C. In many instances a loop turns out to be the cleanest, most readable and fastest way of implementing an algorithm in Julia.
2 Likes
Realize that, unlike other languages where loops are slow and programming for critical tasks revolves around clever use of library functions, in Julia a loop is perfectly fine. So, you shouldn’t hesitate to write a loop if that is the simplest or most obvious way to solve a problem.
4 Likes
Thanks, I didn’t know about the loop thing, I will keep that in mind. I don’t understand exactly what the in
is doing: why is there a dot between the two parenthesis?
This combines two thing: firstly, I create a Fix2
version of the in
function by doing in(df_completion.id)
- this essentially creates a function that checks whether it’s argument is in the id
column of df_completion
. I can use this function like in(df_completion.id)(x)
and it will check whether x
is in df_completion.id
, i.e. it is equivalent to in(x, df_completion.id)
(hence Fix2
- we have fixed the second argument to the function to be df_completion
and thus turned it into a one argument function.
Now we want to apply this function to all id
s in df_missing.id
, so we use broadcasting, which is what the dot is doing - so in(df_completion.id).(x)
applies the Fix2
version of in
to x
elementwise.
1 Like
This seems like a missing functionality, tbh. What you are requesting is something similar to merge data1 data2, update
from stata. I am surprised this doesn’t exist.
Can you file an issue with dataframes?
Also, I think the most transparent, though not necessarily efficient, thing to do here is
julia> df_full = leftjoin(df_missing, df_completion, on = :id, makeunique = true);
julia> df_full.val = map(df_full.val, df_full.val_1) do a, b
ismissing(a) ? b : a
end;
julia> select!(df_full, Not(:val_1))
6×2 DataFrame
│ Row │ id │ val │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 5 │ 1 │
│ 2 │ 2 │ 5 │
│ 3 │ 1 │ 3 │
│ 4 │ 4 │ 8 │
│ 5 │ 6 │ 2 │
│ 6 │ 8 │ 13 │
1 Like