I want to recode symbols in a DataFrame as integers so I can use confusmat to create a confusion matrix. Here is what my DataFrame looks like:
using DataFrames
df = DataFrame(x = [:a,:a,:b,:c,:c], y = [:b,:b,:c,:a,:c])
output
5×2 DataFrame
Row │ x y
│ Symbol Symbol
─────┼────────────────
1 │ a b
2 │ a b
3 │ b c
4 │ c a
5 │ c c
I want to recode the symbols so that:
5×4 DataFrame
Row │ x y new_x new_y
│ Symbol Symbol Int64 Int64
─────┼──────────────────────────────
1 │ a b 1 2
2 │ a b 1 2
3 │ b c 2 3
4 │ c a 3 1
5 │ c c 3 3
Notice that the mapping must be the same in new_x and new_y. How can I do that with DataFrames?
Thanks for your response. Unfortunately, I have identified a case in which the mapping is not identical between the columns. If a value is missing in one of the columns, the mapping will be different:
df = DataFrame(x = [:a,:a,:b,:c,:c], y = [:b,:b,:c,:b,:c])
using CategoricalArrays
df.new_x = levelcode.(categorical(string.(df.x)))
df.new_y = levelcode.(categorical(string.(df.y)))
output
5×4 DataFrame
Row │ x y new_x new_y
│ Symbol Symbol Int64 Int64
─────┼──────────────────────────────
1 │ a b 1 1
2 │ a b 1 1
3 │ b c 2 2
4 │ c b 3 1
5 │ c c 3 2
In column new_x, c is mapped to 3, but it is mapped to 2 in column new_y.
In that case you probably need to extract all unique values from the original columns, create a mapping from these unique values, and then use something like replace to map the columns in the DataFrame to their corresponding new values.
EDIT: for example, if u is the set of all unique values in your initial DataFrame, then
A fun little two liner without any additional packages:
julia> uvals = sort(unique([df.x; df.y]))
3-element Vector{Symbol}:
:a
:b
:c
ulia> leftjoin(leftjoin(df, DataFrame(x = uvals, new_x = 1:length(uvals)), on = :x), DataFrame(y = uvals, new_y = 1:length(uvals)), on = :y)
5×4 DataFrame
Row │ x y new_x new_y
│ Symbol Symbol Int64? Int64?
─────┼────────────────────────────────
1 │ a b 1 2
2 │ a b 1 2
3 │ b c 2 3
4 │ c a 3 1
5 │ c c 3 3
Thank you all for your help! I will mark this as the solution because it does not require an additional package. For future reference, others may consider tbeason’s solution too.
You can do this to ensure that both columns use the same levels:
df = DataFrame(x = [:a,:a,:b,:c,:c], y = [:b,:b,:c,:b,:c])
using CategoricalArrays
levs = string.(union(unique(df.x), unique(df.y)))
df.new_x = levelcode.(categorical(string.(df.x), levels=levs))
df.new_y = levelcode.(categorical(string.(df.y), levels=levs))
julia> df
5×4 DataFrame
Row │ x y new_x new_y
│ Symbol Symbol Int64 Int64
─────┼──────────────────────────────
1 │ a b 1 2
2 │ a b 1 2
3 │ b c 2 3
4 │ c b 3 2
5 │ c c 3 3