Encoding categorical variables within a matrix

mthelm85 · December 23, 2019, 6:56pm

I have a question about encoding categorical variables when constructing matrices to feed to ML models.
Consider this MWE:

julia> rawdata = DataFrame(make=["honda", "ford", "subaru"], mpg=[33,29,31], price=[30000,25000,27500])
3×3 DataFrame
│ Row │ make   │ mpg   │ price │
│     │ String │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ honda  │ 33    │ 30000 │
│ 2   │ ford   │ 29    │ 25000 │
│ 3   │ subaru │ 31    │ 27500 │

function encodemake(x::Array)
    uniques = unique(x)
    return [convert(Array, uniques .== n) for n in x]
end

julia> rawdata.make = encodemake(rawdata.make)
3-element Array{Array{Bool,1},1}:
 [1, 0, 0]
 [0, 1, 0]
 [0, 0, 1]

julia> features = convert(Matrix, rawdata[!, [1,2]])
3×2 Array{Any,2}:
 Bool[1, 0, 0]  33
 Bool[0, 1, 0]  29
 Bool[0, 0, 1]  31

julia> labels = rawdata.price
3-element Array{Int64,1}:
 30000
 25000
 27500

I’m trying to understand if it’s okay for a column of the matrix to be a bunch of onehot vectors. When I feed this into DecisionTree.jl to build a random forest, it doesn’t throw any errors so I’m wondering if this actually works. If so, is it generally okay to deal with categorical variables in this way, or should they each be their own column with a simple boolean value? In this case, that would look like this:

rawdata = DataFrame(make=["honda", "ford", "subaru"], mpg=[33,29,31], price=[30000,25000,27500])

for c in unique(rawdata[!, :make])
    rawdata[!, Symbol(c)] = ifelse.(rawdata[!, :make] .== c, 1, 0)
end

julia> rawdata
3×6 DataFrame
│ Row │ make   │ mpg   │ price │ honda │ ford  │ subaru │
│     │ String │ Int64 │ Int64 │ Int64 │ Int64 │ Int64  │
├─────┼────────┼───────┼───────┼───────┼───────┼────────┤
│ 1   │ honda  │ 33    │ 30000 │ 1     │ 0     │ 0      │
│ 2   │ ford   │ 29    │ 25000 │ 0     │ 1     │ 0      │
│ 3   │ subaru │ 31    │ 27500 │ 0     │ 0     │ 1      │

julia> features = convert(Matrix, rawdata[!, [2,4,5,6]])
3×4 Array{Int64,2}:
 33  1  0  0
 29  0  1  0
 31  0  0  1

julia> labels = rawdata.price
3-element Array{Int64,1}:
 30000
 25000
 27500

Again, no complaints when building a random forest with this via DecisionTree.jl but the results are slightly different.

mthelm85 · December 26, 2019, 12:28am

I’m really surprised no one has replied to this…maybe everyone is too busy with the holidays so I’ll try re-posting after the new year

tlienart · December 27, 2019, 4:16pm

The second approach is the standard one, maybe just use that? the simpler reason being that while DT may take a vector of OHV, other algorithms may not (e.g.: not sure that a GLM regression will work)

Also, (with apologies for the product placement):

using DataFrames, MLJ
rawdata = DataFrame(make=["honda", "ford", "subaru"], mpg=[33,29,31], price=[30000,25000,27500])
@load DecisionTreeRegressor
pp = @pipeline OneHotDTR(
                X -> coerce(X, :make=>Multiclass),
                hot = OneHotEncoder(),
                dtc = DecisionTreeRegressor())
X = select(rawdata, Not(:price))
y = rawdata.price
dtm = machine(pp, X, y)
fit!(dtm)

Explanation:

X -> coerce ... indicates that the column make should be interpreted as multi class (not automatic for a column of strings)
OneHotEncoder will be applied on all such categorical variables (here just :make)

The rest should be straightforward

mthelm85 · December 28, 2019, 1:13am

Thanks for the info!

Topic		Replies	Views
All the ways to do one-hot encoding General Usage	30	11331	October 20, 2024
Dummy Encoding(One hot encoding) from PooledDataArray General Usage question	10	3185	June 9, 2017
Generating dummy variables from a vector of strings (one-hot encoding) New to Julia	9	3304	July 31, 2021
Learning Julia: Writing a onehot encoder Tooling	5	1471	October 23, 2019
PosDefException: matrix is not positive definite; Cholesky factorization failed Statistics	9	1639	July 30, 2021

Encoding categorical variables within a matrix

Related topics