Generating dummy variables from a vector of strings (one-hot encoding)

This is probably closer to what you want

julia> df = DataFrame(x = rand(["a", "b"], 10), y = rand(10));

julia> mf = ModelFrame(@formula(y ~ 0 + x), df, contrasts = Dict(:x => DummyCoding())) |> modelmatrix
10×2 Matrix{Float64}:
 0.0  1.0
 0.0  1.0
 1.0  0.0
 1.0  0.0
 0.0  1.0
 0.0  1.0
 0.0  1.0
 1.0  0.0
 1.0  0.0
 1.0  0.0

But it’s still clunky. You have to actually construct a formula with a LHS and then get the X term from that object.

Hopefully someone can come along with something more elegant.

Of course, there is always

julia> function make_dummy_matrix(x::AbstractVector)
           u = unique(x)
           m = Matrix{Bool}(undef, length(x), length(u))
           for i in eachindex(u)
               @. m[:, i] = x .== u[i]
           end
           return m
       end;

julia> make_dummy_matrix(["a", "b", "a", "a"])
4×2 Matrix{Bool}:
 1  0
 0  1
 1  0
 1  0

though maybe this won’t scale that well to more complicated requirements.

2 Likes