Generating dummy variables from a vector of strings (one-hot encoding)

Hello everyone,

New Julia user here.

Suppose I have a vector of strings, and I’d like to generate a dummy variable for each unique value that appears in that string.

For instance:

data = ["a", "b", "a", "a", "b"]

I want to create a matrix of dummy variables along the following lines:

mat_col1    mat_col2
1		        0
0		        1
1		        0
1		        0
0		        1

where mat_col1 is a dummy for the level “a” and mat_col2 for level “b”.

I was wondering how this could be done. I’ve messed around with StatsModels.ContrastMatrix but that creates a dummy for every observation x level, not level.

For instance,

using StatsModels
StatsModels.ContrastsMatrix(StatsModels.DummyCoding(), ["a", "b", "a", "a", "b"]).matrix

gives

0.0  0.0  0.0  0.0
1.0  0.0  0.0  0.0
0.0  1.0  0.0  0.0
0.0  0.0  1.0  0.0
0.0  0.0  0.0  1.0

So every column here creates a row-specific dummy variable by level (row-1"a" is a different dummy variable from row-3"a"). This therefore doesn’t give me what I need.

I’m also looking for a full dummy coding, i.e. dv for both a and b, rather than for just b. StatsModels.ContrastMatrix seems to choose a base level and leave that out, although FullDummyCoding seems to get around that.

The example I’ve given is stylized; my actual problem has many observations and levels, so manually creating a dv for each unique value is tough.

That’s very weird behavior from StatsModels. It’s not what I would have expected… Maybe @dave.f.kleinschmidt can pop in and let us know what’s going on.

One thing though, is taking a look at the docs of StatsModels.ContrastsMatrix with ?, the 2nd argument is a levels, not the values themselves. So I think it’s confused because the elements of the vector are not unique.

1 Like

This is probably closer to what you want

julia> df = DataFrame(x = rand(["a", "b"], 10), y = rand(10));

julia> mf = ModelFrame(@formula(y ~ 0 + x), df, contrasts = Dict(:x => DummyCoding())) |> modelmatrix
10×2 Matrix{Float64}:
 0.0  1.0
 0.0  1.0
 1.0  0.0
 1.0  0.0
 0.0  1.0
 0.0  1.0
 0.0  1.0
 1.0  0.0
 1.0  0.0
 1.0  0.0

But it’s still clunky. You have to actually construct a formula with a LHS and then get the X term from that object.

Hopefully someone can come along with something more elegant.

Of course, there is always

julia> function make_dummy_matrix(x::AbstractVector)
           u = unique(x)
           m = Matrix{Bool}(undef, length(x), length(u))
           for i in eachindex(u)
               @. m[:, i] = x .== u[i]
           end
           return m
       end;

julia> make_dummy_matrix(["a", "b", "a", "a"])
4×2 Matrix{Bool}:
 1  0
 0  1
 1  0
 1  0

though maybe this won’t scale that well to more complicated requirements.

2 Likes

Well, it’s not as good as previous solution, but very fun (in its own way). Inspired by Flux one-hot encoding

using SparseArrays
function make_dummy(data::AbstractVector)
    udata = unique(data)
    inds = map(x -> findfirst(==(x), udata), data)
    sparse(1:length(data), inds, ones(Int, length(data)))
end

and it works like this

julia> data = ["a", "b", "a", "a", "b"]
5-element Vector{String}:
 "a"
 "b"
 "a"
 "a"
 "b"

julia> make_dummy(data)
5×2 SparseMatrixCSC{Int64, Int64} with 5 stored entries:
 1  ⋅
 ⋅  1
 1  ⋅
 1  ⋅
 ⋅  1

And this looks even better

julia> data = rand('a':'z', 100);

julia> make_dummy(data)
100×26 SparseMatrixCSC{Int64, Int64} with 100 stored entries:
⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠁⠀⠠⢂⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⡀⠀⠑⠄⠀⠀⠀⠀⠀⠀
⠀⠀⡁⠀⠀⠀⠐⠄⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠐⠈⢄⠀⠀⠀⠀
⠀⠀⠀⡈⠀⠀⠀⠀⠀⠆⠀⠀⠀
⠀⠀⠀⠀⠀⠄⠀⡀⠀⠃⠀⠀⠀
⠀⠰⠀⠀⠈⠀⠀⠀⢀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠑⠀⠠⡀⠀⠀
⠀⠀⠀⠄⠀⠀⠀⠀⠀⠈⡐⠀⠀
⠀⠀⠀⠀⠐⡀⠀⠀⠀⠀⠠⠁⠀
⠀⠀⠀⢀⠀⠀⠐⠄⠀⠁⠀⠀⠀
⠂⠀⠀⠀⠀⠄⠀⢀⠀⠀⠀⠈⠀
⠀⠀⠀⠀⠠⠀⠀⠀⠀⡀⠁⠐⠀
⠈⢀⠂⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀
⠄⠀⠀⠀⠀⠀⠀⢀⠀⠁⠀⠐⠀
⠄⠀⠀⡀⠈⠀⠀⠀⠀⠐⠀⠀⠀
⠀⢀⠀⠆⠀⠀⠀⠀⠀⠈⠀⠀⠀
⠀⠈⠀⠀⠀⢀⠀⠀⠀⠂⠀⠠⠀
⠈⠀⠀⠀⠀⠀⠀⠀⠂⡀⠀⠀⠄
⠀⠀⠀⠀⠀⠁⠀⠀⡄⠀⠀⠀⠐
⠀⠀⠈⠀⠀⠠⠀⠐⠀⠀⠀⢀⠀
⠀⠀⠃⠀⡀⠀⠀⠀⠀⠀⠀⠠⠀
⡀⠀⠀⠀⠀⠂⠀⠀⠁⠀⠀⠄⠀
3 Likes

Hi @rubaiyat and welcome! There are lots of tools for doing this in Julia…I think MLJ has some as well, as does Flux: One-Hot Encoding · Flux (as I think someone else mentioned).

If you’re just looking to do a one-off conversion, StatsModels might be a bit overkill for that. I’d recommend using it when you’re trying to (1) work with data that’s coming from a table and (2) combine the one-hot encoded features with some others into a single “model matrix” (like for fitting a regression model), which is really the problem that statsmodels was designed to solve.

If you do want to use statsmodels to do this, I would do something like this:

julia> using StatsModels, DataFrames

julia> data = DataFrame(x = string.(rand('a':'c', 10)))
10×1 DataFrame
 Row │ x
     │ String
─────┼────────
   1 │ c
   2 │ b
   3 │ b
   4 │ b
   5 │ c
   6 │ c
   7 │ a
   8 │ b
   9 │ a
  10 │ c

julia> t = concrete_term(Term(:x), data.x, StatsModels.FullDummyCoding())
x(StatsModels.FullDummyCoding:3→3)

julia> modelcols(t, data)
10×3 Matrix{Float64}:
 0.0  0.0  1.0
 0.0  1.0  0.0
 0.0  1.0  0.0
 0.0  1.0  0.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 1.0  0.0  0.0
 0.0  1.0  0.0
 1.0  0.0  0.0
 0.0  0.0  1.0

Or:

julia> f = @formula(0 ~ x)
FormulaTerm
Response:
  0
Predictors:
  x(unknown)

julia> f = apply_schema(f, schema(data, Dict(:x => StatsModels.FullDummyCoding())))
FormulaTerm
Response:
  0
Predictors:
  x(StatsModels.FullDummyCoding:3→3)

julia> last(modelcols(f, data))
10×3 Matrix{Float64}:
 0.0  0.0  1.0
 0.0  1.0  0.0
 0.0  1.0  0.0
 0.0  1.0  0.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 1.0  0.0  0.0
 0.0  1.0  0.0
 1.0  0.0  0.0
 0.0  0.0  1.0

(schema calls concrete_term so these are basically equivalent, just a bit more direct for the first one; last is necessary in the second version because modelcols(::FormulaTerm) gives you a tuple of the model cols for the left- and right-hand sides. You could replace f with the Term(:x) from the first method, too, and then last wouldn’t be necessary).

As you can see, this is pretty awkward if all you want to do is get a one-hot encoding matrix :slight_smile: Also note that you have to use StatsModels.FullDummyCoding because the default dummy coding drops one level, since this is what people expect most of the time for “statistical models” (like regression), except in certain special circumstances that StatsModels tries to detect (lots of info on this in the docs).

4 Likes

Yeah that’s what’s happening…we should probably throw an error StatsModels v0.6.24 will error (https://github.com/JuliaStats/StatsModels.jl/pull/231) if the levels are not unique! I think this is first time I know of this coming up in practice though :slight_smile:

4 Likes

Also, to further clarify what ContrastsMatrix is doing/means: the contrasts matrix is essentially the embedding matrix, where each row corresponds to the model matrix row that will be generated for the corresponding level of the data. It tells you HOW to generate the one-hot (etc.) embedded representation but doesn’t actually generate it for you! That’s what modelcols does.

3 Likes

Here is a pretty complete list for how to do one-hot encoding.

It looks like the easiest might be StatsBase.indicatormat

3 Likes

My favorite All the ways to do one-hot encoding - #4 by Mattriks

3 Likes

@danielw2904, which for the OP case could be slightly adjusted to:

julia> permutedims(unique(data)) .== data   # by @Mattriks
5×2 BitMatrix:
 1  0
 0  1
 1  0
 1  0
 0  1
2 Likes