Recode multiple columns in DataFrame as integer

Hi all,

I want to recode symbols in a DataFrame as integers so I can use confusmat to create a confusion matrix. Here is what my DataFrame looks like:

using DataFrames 

df = DataFrame(x = [:a,:a,:b,:c,:c], y = [:b,:b,:c,:a,:c])

output

5×2 DataFrame
 Row │ x       y      
     │ Symbol  Symbol 
─────┼────────────────
   1 │ a       b
   2 │ a       b
   3 │ b       c
   4 │ c       a
   5 │ c       c

I want to recode the symbols so that:

5×4 DataFrame
 Row │ x       y       new_x  new_y 
     │ Symbol  Symbol  Int64  Int64 
─────┼──────────────────────────────
   1 │ a       b           1      2
   2 │ a       b           1      2
   3 │ b       c           2      3
   4 │ c       a           3      1
   5 │ c       c           3      3

Notice that the mapping must be the same in new_x and new_y. How can I do that with DataFrames?

Thanks!

I don’t have a great solution. Something like

julia> using CategoricalArrays

julia> levelcode.(categorical(string.(df.x)))
5-element Vector{Int64}:
 1
 1
 2
 3
 3

But this might be an X-Y problem. Unlike Stata, columns don’t need to be encoded as integers in order to use fixed effects.

1 Like

Thanks for your response. Unfortunately, I have identified a case in which the mapping is not identical between the columns. If a value is missing in one of the columns, the mapping will be different:

df = DataFrame(x = [:a,:a,:b,:c,:c], y = [:b,:b,:c,:b,:c])

using CategoricalArrays

df.new_x = levelcode.(categorical(string.(df.x)))

df.new_y = levelcode.(categorical(string.(df.y)))

output

5×4 DataFrame
 Row │ x       y       new_x  new_y 
     │ Symbol  Symbol  Int64  Int64 
─────┼──────────────────────────────
   1 │ a       b           1      1
   2 │ a       b           1      1
   3 │ b       c           2      2
   4 │ c       b           3      1
   5 │ c       c           3      2

In column new_x, c is mapped to 3, but it is mapped to 2 in column new_y.

In that case you probably need to extract all unique values from the original columns, create a mapping from these unique values, and then use something like replace to map the columns in the DataFrame to their corresponding new values.

EDIT: for example, if u is the set of all unique values in your initial DataFrame, then

mapping = Dict(zip(u,levelcode.(CategoricalArray(string.(u)))))
Dict{Symbol, Int64} with 3 entries:
  :a => 1
  :b => 2
  :c => 3

will create the mapping for you.

1 Like

A fun little two liner without any additional packages:

julia> uvals = sort(unique([df.x; df.y]))
3-element Vector{Symbol}:
 :a
 :b
 :c

ulia> leftjoin(leftjoin(df, DataFrame(x = uvals, new_x = 1:length(uvals)), on = :x), DataFrame(y = uvals, new_y = 1:length(uvals)), on = :y)
5×4 DataFrame
 Row │ x       y       new_x   new_y  
     │ Symbol  Symbol  Int64?  Int64? 
─────┼────────────────────────────────
   1 │ a       b            1       2
   2 │ a       b            1       2
   3 │ b       c            2       3
   4 │ c       a            3       1
   5 │ c       c            3       3
2 Likes

Thank you all for your help! I will mark this as the solution because it does not require an additional package. For future reference, others may consider tbeason’s solution too.

I was just copying from @pdeffebach lol

1 Like

You can do this to ensure that both columns use the same levels:

df = DataFrame(x = [:a,:a,:b,:c,:c], y = [:b,:b,:c,:b,:c])

using CategoricalArrays

levs = string.(union(unique(df.x), unique(df.y)))

df.new_x = levelcode.(categorical(string.(df.x), levels=levs))

df.new_y = levelcode.(categorical(string.(df.y), levels=levs))


julia> df
5×4 DataFrame
 Row │ x       y       new_x  new_y 
     │ Symbol  Symbol  Int64  Int64 
─────┼──────────────────────────────
   1 │ a       b           1      2
   2 │ a       b           1      2
   3 │ b       c           2      3
   4 │ c       b           3      2
   5 │ c       c           3      3
2 Likes

Another possibility:

hcat(df, Int.(getindex.(string.(df),1)) .- 96, makeunique=true)
1 Like