Encoding categorical variables within a matrix

I have a question about encoding categorical variables when constructing matrices to feed to ML models.
Consider this MWE:

julia> rawdata = DataFrame(make=["honda", "ford", "subaru"], mpg=[33,29,31], price=[30000,25000,27500])
3×3 DataFrame
│ Row │ make   │ mpg   │ price │
│     │ String │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ honda  │ 33    │ 30000 │
│ 2   │ ford   │ 29    │ 25000 │
│ 3   │ subaru │ 31    │ 27500 │

function encodemake(x::Array)
    uniques = unique(x)
    return [convert(Array, uniques .== n) for n in x]
end

julia> rawdata.make = encodemake(rawdata.make)
3-element Array{Array{Bool,1},1}:
 [1, 0, 0]
 [0, 1, 0]
 [0, 0, 1]

julia> features = convert(Matrix, rawdata[!, [1,2]])
3×2 Array{Any,2}:
 Bool[1, 0, 0]  33
 Bool[0, 1, 0]  29
 Bool[0, 0, 1]  31

julia> labels = rawdata.price
3-element Array{Int64,1}:
 30000
 25000
 27500

I’m trying to understand if it’s okay for a column of the matrix to be a bunch of onehot vectors. When I feed this into DecisionTree.jl to build a random forest, it doesn’t throw any errors so I’m wondering if this actually works. If so, is it generally okay to deal with categorical variables in this way, or should they each be their own column with a simple boolean value? In this case, that would look like this:

rawdata = DataFrame(make=["honda", "ford", "subaru"], mpg=[33,29,31], price=[30000,25000,27500])

for c in unique(rawdata[!, :make])
    rawdata[!, Symbol(c)] = ifelse.(rawdata[!, :make] .== c, 1, 0)
end

julia> rawdata
3×6 DataFrame
│ Row │ make   │ mpg   │ price │ honda │ ford  │ subaru │
│     │ String │ Int64 │ Int64 │ Int64 │ Int64 │ Int64  │
├─────┼────────┼───────┼───────┼───────┼───────┼────────┤
│ 1   │ honda  │ 33    │ 30000 │ 1     │ 0     │ 0      │
│ 2   │ ford   │ 29    │ 25000 │ 0     │ 1     │ 0      │
│ 3   │ subaru │ 31    │ 27500 │ 0     │ 0     │ 1      │

julia> features = convert(Matrix, rawdata[!, [2,4,5,6]])
3×4 Array{Int64,2}:
 33  1  0  0
 29  0  1  0
 31  0  0  1

julia> labels = rawdata.price
3-element Array{Int64,1}:
 30000
 25000
 27500

Again, no complaints when building a random forest with this via DecisionTree.jl but the results are slightly different.

1 Like

I’m really surprised no one has replied to this…maybe everyone is too busy with the holidays so I’ll try re-posting after the new year :clinking_glasses: :stuck_out_tongue_winking_eye:

1 Like

The second approach is the standard one, maybe just use that? the simpler reason being that while DT may take a vector of OHV, other algorithms may not (e.g.: not sure that a GLM regression will work)

Also, (with apologies for the product placement):

using DataFrames, MLJ
rawdata = DataFrame(make=["honda", "ford", "subaru"], mpg=[33,29,31], price=[30000,25000,27500])
@load DecisionTreeRegressor
pp = @pipeline OneHotDTR(
                X -> coerce(X, :make=>Multiclass),
                hot = OneHotEncoder(),
                dtc = DecisionTreeRegressor())
X = select(rawdata, Not(:price))
y = rawdata.price
dtm = machine(pp, X, y)
fit!(dtm)

Explanation:

  • X -> coerce ... indicates that the column make should be interpreted as multi class (not automatic for a column of strings)
  • OneHotEncoder will be applied on all such categorical variables (here just :make)

The rest should be straightforward

3 Likes

Thanks for the info! :smiley:

1 Like