I have a question about encoding categorical variables when constructing matrices to feed to ML models.
Consider this MWE:
julia> rawdata = DataFrame(make=["honda", "ford", "subaru"], mpg=[33,29,31], price=[30000,25000,27500])
3×3 DataFrame
│ Row │ make │ mpg │ price │
│ │ String │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ honda │ 33 │ 30000 │
│ 2 │ ford │ 29 │ 25000 │
│ 3 │ subaru │ 31 │ 27500 │
function encodemake(x::Array)
uniques = unique(x)
return [convert(Array, uniques .== n) for n in x]
end
julia> rawdata.make = encodemake(rawdata.make)
3-element Array{Array{Bool,1},1}:
[1, 0, 0]
[0, 1, 0]
[0, 0, 1]
julia> features = convert(Matrix, rawdata[!, [1,2]])
3×2 Array{Any,2}:
Bool[1, 0, 0] 33
Bool[0, 1, 0] 29
Bool[0, 0, 1] 31
julia> labels = rawdata.price
3-element Array{Int64,1}:
30000
25000
27500
I’m trying to understand if it’s okay for a column of the matrix to be a bunch of onehot vectors. When I feed this into DecisionTree.jl to build a random forest, it doesn’t throw any errors so I’m wondering if this actually works. If so, is it generally okay to deal with categorical variables in this way, or should they each be their own column with a simple boolean value? In this case, that would look like this:
rawdata = DataFrame(make=["honda", "ford", "subaru"], mpg=[33,29,31], price=[30000,25000,27500])
for c in unique(rawdata[!, :make])
rawdata[!, Symbol(c)] = ifelse.(rawdata[!, :make] .== c, 1, 0)
end
julia> rawdata
3×6 DataFrame
│ Row │ make │ mpg │ price │ honda │ ford │ subaru │
│ │ String │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┼───────┼───────┼────────┤
│ 1 │ honda │ 33 │ 30000 │ 1 │ 0 │ 0 │
│ 2 │ ford │ 29 │ 25000 │ 0 │ 1 │ 0 │
│ 3 │ subaru │ 31 │ 27500 │ 0 │ 0 │ 1 │
julia> features = convert(Matrix, rawdata[!, [2,4,5,6]])
3×4 Array{Int64,2}:
33 1 0 0
29 0 1 0
31 0 0 1
julia> labels = rawdata.price
3-element Array{Int64,1}:
30000
25000
27500
Again, no complaints when building a random forest with this via DecisionTree.jl but the results are slightly different.