Generating dummy variables from a vector of strings (one-hot encoding)

rubaiyat · July 29, 2021, 4:50pm

Hello everyone,

New Julia user here.

Suppose I have a vector of strings, and I’d like to generate a dummy variable for each unique value that appears in that string.

For instance:

data = ["a", "b", "a", "a", "b"]

I want to create a matrix of dummy variables along the following lines:

mat_col1    mat_col2
1		        0
0		        1
1		        0
1		        0
0		        1

where mat_col1 is a dummy for the level “a” and mat_col2 for level “b”.

I was wondering how this could be done. I’ve messed around with StatsModels.ContrastMatrix but that creates a dummy for every observation x level, not level.

For instance,

using StatsModels
StatsModels.ContrastsMatrix(StatsModels.DummyCoding(), ["a", "b", "a", "a", "b"]).matrix

gives

0.0  0.0  0.0  0.0
1.0  0.0  0.0  0.0
0.0  1.0  0.0  0.0
0.0  0.0  1.0  0.0
0.0  0.0  0.0  1.0

So every column here creates a row-specific dummy variable by level (row-1"a" is a different dummy variable from row-3"a"). This therefore doesn’t give me what I need.

I’m also looking for a full dummy coding, i.e. dv for both a and b, rather than for just b. StatsModels.ContrastMatrix seems to choose a base level and leave that out, although FullDummyCoding seems to get around that.

The example I’ve given is stylized; my actual problem has many observations and levels, so manually creating a dv for each unique value is tough.

pdeffebach · July 29, 2021, 6:42pm

That’s very weird behavior from StatsModels. It’s not what I would have expected… Maybe @dave.f.kleinschmidt can pop in and let us know what’s going on.

One thing though, is taking a look at the docs of StatsModels.ContrastsMatrix with ?, the 2nd argument is a levels, not the values themselves. So I think it’s confused because the elements of the vector are not unique.

pdeffebach · July 29, 2021, 6:54pm

This is probably closer to what you want

julia> df = DataFrame(x = rand(["a", "b"], 10), y = rand(10));

julia> mf = ModelFrame(@formula(y ~ 0 + x), df, contrasts = Dict(:x => DummyCoding())) |> modelmatrix
10×2 Matrix{Float64}:
 0.0  1.0
 0.0  1.0
 1.0  0.0
 1.0  0.0
 0.0  1.0
 0.0  1.0
 0.0  1.0
 1.0  0.0
 1.0  0.0
 1.0  0.0

But it’s still clunky. You have to actually construct a formula with a LHS and then get the X term from that object.

Hopefully someone can come along with something more elegant.

Of course, there is always

julia> function make_dummy_matrix(x::AbstractVector)
           u = unique(x)
           m = Matrix{Bool}(undef, length(x), length(u))
           for i in eachindex(u)
               @. m[:, i] = x .== u[i]
           end
           return m
       end;

julia> make_dummy_matrix(["a", "b", "a", "a"])
4×2 Matrix{Bool}:
 1  0
 0  1
 1  0
 1  0

though maybe this won’t scale that well to more complicated requirements.

Skoffer · July 29, 2021, 7:48pm

Well, it’s not as good as previous solution, but very fun (in its own way). Inspired by Flux one-hot encoding

using SparseArrays
function make_dummy(data::AbstractVector)
    udata = unique(data)
    inds = map(x -> findfirst(==(x), udata), data)
    sparse(1:length(data), inds, ones(Int, length(data)))
end

and it works like this

julia> data = ["a", "b", "a", "a", "b"]
5-element Vector{String}:
 "a"
 "b"
 "a"
 "a"
 "b"

julia> make_dummy(data)
5×2 SparseMatrixCSC{Int64, Int64} with 5 stored entries:
 1  ⋅
 ⋅  1
 1  ⋅
 1  ⋅
 ⋅  1

And this looks even better

julia> data = rand('a':'z', 100);

julia> make_dummy(data)
100×26 SparseMatrixCSC{Int64, Int64} with 100 stored entries:
⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠁⠀⠠⢂⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⡀⠀⠑⠄⠀⠀⠀⠀⠀⠀
⠀⠀⡁⠀⠀⠀⠐⠄⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠐⠈⢄⠀⠀⠀⠀
⠀⠀⠀⡈⠀⠀⠀⠀⠀⠆⠀⠀⠀
⠀⠀⠀⠀⠀⠄⠀⡀⠀⠃⠀⠀⠀
⠀⠰⠀⠀⠈⠀⠀⠀⢀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠑⠀⠠⡀⠀⠀
⠀⠀⠀⠄⠀⠀⠀⠀⠀⠈⡐⠀⠀
⠀⠀⠀⠀⠐⡀⠀⠀⠀⠀⠠⠁⠀
⠀⠀⠀⢀⠀⠀⠐⠄⠀⠁⠀⠀⠀
⠂⠀⠀⠀⠀⠄⠀⢀⠀⠀⠀⠈⠀
⠀⠀⠀⠀⠠⠀⠀⠀⠀⡀⠁⠐⠀
⠈⢀⠂⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀
⠄⠀⠀⠀⠀⠀⠀⢀⠀⠁⠀⠐⠀
⠄⠀⠀⡀⠈⠀⠀⠀⠀⠐⠀⠀⠀
⠀⢀⠀⠆⠀⠀⠀⠀⠀⠈⠀⠀⠀
⠀⠈⠀⠀⠀⢀⠀⠀⠀⠂⠀⠠⠀
⠈⠀⠀⠀⠀⠀⠀⠀⠂⡀⠀⠀⠄
⠀⠀⠀⠀⠀⠁⠀⠀⡄⠀⠀⠀⠐
⠀⠀⠈⠀⠀⠠⠀⠐⠀⠀⠀⢀⠀
⠀⠀⠃⠀⡀⠀⠀⠀⠀⠀⠀⠠⠀
⡀⠀⠀⠀⠀⠂⠀⠀⠁⠀⠀⠄⠀

dave.f.kleinschmidt · July 30, 2021, 2:15pm

Hi @rubaiyat and welcome! There are lots of tools for doing this in Julia…I think MLJ has some as well, as does Flux: One-Hot Encoding · Flux (as I think someone else mentioned).

If you’re just looking to do a one-off conversion, StatsModels might be a bit overkill for that. I’d recommend using it when you’re trying to (1) work with data that’s coming from a table and (2) combine the one-hot encoded features with some others into a single “model matrix” (like for fitting a regression model), which is really the problem that statsmodels was designed to solve.

If you do want to use statsmodels to do this, I would do something like this:

julia> using StatsModels, DataFrames

julia> data = DataFrame(x = string.(rand('a':'c', 10)))
10×1 DataFrame
 Row │ x
     │ String
─────┼────────
   1 │ c
   2 │ b
   3 │ b
   4 │ b
   5 │ c
   6 │ c
   7 │ a
   8 │ b
   9 │ a
  10 │ c

julia> t = concrete_term(Term(:x), data.x, StatsModels.FullDummyCoding())
x(StatsModels.FullDummyCoding:3→3)

julia> modelcols(t, data)
10×3 Matrix{Float64}:
 0.0  0.0  1.0
 0.0  1.0  0.0
 0.0  1.0  0.0
 0.0  1.0  0.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 1.0  0.0  0.0
 0.0  1.0  0.0
 1.0  0.0  0.0
 0.0  0.0  1.0

Or:

julia> f = @formula(0 ~ x)
FormulaTerm
Response:
  0
Predictors:
  x(unknown)

julia> f = apply_schema(f, schema(data, Dict(:x => StatsModels.FullDummyCoding())))
FormulaTerm
Response:
  0
Predictors:
  x(StatsModels.FullDummyCoding:3→3)

julia> last(modelcols(f, data))
10×3 Matrix{Float64}:
 0.0  0.0  1.0
 0.0  1.0  0.0
 0.0  1.0  0.0
 0.0  1.0  0.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 1.0  0.0  0.0
 0.0  1.0  0.0
 1.0  0.0  0.0
 0.0  0.0  1.0

(schema calls concrete_term so these are basically equivalent, just a bit more direct for the first one; last is necessary in the second version because modelcols(::FormulaTerm) gives you a tuple of the model cols for the left- and right-hand sides. You could replace f with the Term(:x) from the first method, too, and then last wouldn’t be necessary).

As you can see, this is pretty awkward if all you want to do is get a one-hot encoding matrix Also note that you have to use StatsModels.FullDummyCoding because the default dummy coding drops one level, since this is what people expect most of the time for “statistical models” (like regression), except in certain special circumstances that StatsModels tries to detect (lots of info on this in the docs).

dave.f.kleinschmidt · July 30, 2021, 2:17pm

Yeah that’s what’s happening…~~we should probably throw an error~~ StatsModels v0.6.24 will error (https://github.com/JuliaStats/StatsModels.jl/pull/231) if the levels are not unique! I think this is first time I know of this coming up in practice though

dave.f.kleinschmidt · July 30, 2021, 2:20pm

Also, to further clarify what ContrastsMatrix is doing/means: the contrasts matrix is essentially the embedding matrix, where each row corresponds to the model matrix row that will be generated for the corresponding level of the data. It tells you HOW to generate the one-hot (etc.) embedded representation but doesn’t actually generate it for you! That’s what modelcols does.

pdeffebach · July 30, 2021, 10:35pm

Here is a pretty complete list for how to do one-hot encoding.

It looks like the easiest might be StatsBase.indicatormat

danielw2904 · July 30, 2021, 11:12pm

My favorite All the ways to do one-hot encoding - #4 by Mattriks

rafael.guerra · July 31, 2021, 4:53pm

@danielw2904, which for the OP case could be slightly adjusted to:

julia> permutedims(unique(data)) .== data   # by @Mattriks
5×2 BitMatrix:
 1  0
 0  1
 1  0
 1  0
 0  1

Topic		Replies	Views
All the ways to do one-hot encoding General Usage	30	11330	October 20, 2024
Learning Julia: Writing a onehot encoder Tooling	5	1471	October 23, 2019
Encoding categorical variables within a matrix Machine Learning machine-learning	3	2722	December 28, 2019
Dummy Encoding(One hot encoding) from PooledDataArray General Usage question	10	3185	June 9, 2017
Having trouble with StatsModels.jl `modelmatrix` to replicate R results Statistics	1	63	April 11, 2025

Generating dummy variables from a vector of strings (one-hot encoding)

Related topics