All the ways to do one-hot encoding

xiaodai · July 17, 2021, 10:42am

There are many ways to do one hot encoding in Julia. I want to list down some ways

Method	Comment
Roll your own	It’s not too hard
Flux.onehot & Flux.onehotbatch	pretty heavy to depend on Flux so normally avoided unless u r using Flux anyway
`DataConvenience.onehot`	can be applied directly to dataframes
`FeatureTransform.OneHotEncoder`	They syntax for FeatureTransform is still on the verbose side for my liking imo
`MLJ.OneHotEncoder()`	I really want to like MLJ but it’s quite heavy and makes me want to avoid it. What’s with this `machine`? Why do I need a machine to do one hot encoding?
`ScikitLearn.jl`	Hmm, need to install a bunch of Python stuff. Thank you no thank you for such a simple thing
Keep your data in categorical format	You need to find libraries that accept that
`unique(x) .== permutedims(x)`	due to @Mattriks. Great for if you don’t need sparse representation

That’s it. Anything I’ve missed?

bkamins · July 17, 2021, 10:52am

julia> using StatsBase

julia> x = ["a", "b", "c", "a", "b", "d"]
6-element Vector{String}:
 "a"
 "b"
 "c"
 "a"
 "b"
 "d"

julia> indicatormat(x)
4×6 Matrix{Bool}:
 1  0  0  1  0  0
 0  1  0  0  1  0
 0  0  1  0  0  0
 0  0  0  0  0  1

but normally I do your option 1 :

julia> df = DataFrame(x=x)
6×1 DataFrame
 Row │ x
     │ String
─────┼────────
   1 │ a
   2 │ b
   3 │ c
   4 │ a
   5 │ b
   6 │ d

julia> select(df, [:x => ByRow(isequal(v))=> Symbol(v) for v in unique(df.x)])
6×4 DataFrame
 Row │ a      b      c      d
     │ Bool   Bool   Bool   Bool
─────┼────────────────────────────
   1 │  true  false  false  false
   2 │ false   true  false  false
   3 │ false  false   true  false
   4 │  true  false  false  false
   5 │ false   true  false  false
   6 │ false  false  false   true

(DataConvenience.jl is nice )

xiaodai · July 17, 2021, 11:01am

Yeah. Nice. I think the issue with DataConvenience.jl is that it’s not designed for production use. So missing categories at production time is harder to handle.

Mattriks · July 17, 2021, 11:38am

unique(x) .== permutedims(x)

oxinabox · July 17, 2021, 12:04pm

StatsModels.jl DummyCoding
https://juliastats.org/StatsModels.jl/stable/contrasts/#StatsModels.DummyCoding

oxinabox · July 17, 2021, 1:12pm

Mind Blown.

julia> x = [1, 2, 1, 3, 2];

julia> unique(x) .== permutedims(x)
3×5 BitMatrix:
 1  0  1  0  0
 0  1  0  0  1
 0  0  0  1  0

That is so elegant.

gbaraldi · July 17, 2021, 1:23pm

lostella · July 17, 2021, 3:30pm

Brilliant! Not sure what permutedims does in case x is multidimensional, so maybe the following is slightly more general:

unique(x) .== reshape(x, (1, size(x)…))

jling · July 17, 2021, 3:44pm

how do you one-hot encode a multidimensional x?

It’s so nice our broadcast would do outer automatically

DoktorMike · July 17, 2021, 5:59pm

Not in a million years would I have thought of that. Awesome

ToucheSir · July 17, 2021, 6:37pm

Flux’s OneHotArray has gotten a lot of work recently and is pretty robust. It’s also self-contained, so we could theoretically pull it into a separate package if there’s enough interest.

e3c6 · November 6, 2021, 12:46pm

https://github.com/cossio/OneHot.jl

xiaodai · November 8, 2021, 2:52am

a doc or simple instructions in the readme would be helpful.

jamaas · November 11, 2021, 1:43pm

Ok, here is your challenge. I’m new to this and struggling to get any onehot conversion to work. This is a MWE but here are some very small DNA sequences, what is the best way to “onehot” convert this object, which I think would normally be in a dataframe, for ml? Note there are the standard 4 bases and a couple of unknown “N”'s in the reads. Thx. J

TGTCCGGCTCACCCACATAACCATATATATATATATAG
TATATAATAACCATTAACCAATATATATGGTTATGTGG
CAACATCATTAATTTANAGAGATTTTACTATGGAATAA
TTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATT
TTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT

nilshg · November 11, 2021, 2:05pm

Not entirely sure how this fits into a DataFrame or what output you’re expecting, but one way:

julia> s = "TGTCCGGCTCACCCACATAACCATATATATATATATAGTATATAATAACCATTAACCAATATATATGGTTATGTGGCAACATCATTAATTTANAGAGATTTTACTATGGAATAATTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATTTTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT"
"TGTCCGGCTCACCCACATAACCATATATATATATATAGTATATAATAACCATTAACCAATATATATGGTTATGTGGCAACATCATTAATTTANAGAGATTTTACTATGGAATAATTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATTTTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT"

julia> [collect(s) .== x for x ∈ ['A', 'C', 'G', 'T', 'N']]
5-element Vector{BitVector}:
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 1, 0, 1, 0, 0, 0, 0, 1, 0]
 [0, 0, 0, 1, 1, 0, 0, 1, 0, 1  …  0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
 [0, 1, 0, 0, 0, 1, 1, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
 [1, 0, 1, 0, 0, 0, 0, 0, 1, 0  …  1, 0, 1, 0, 0, 1, 0, 1, 0, 1]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

(where the first vector is 1 when character is A, the second vector is 1 when the character is C etc)

bkamins · November 11, 2021, 2:07pm

I assume there are 5 sequences, so you can e.g. do to also keep track of the sequence ID:

julia> using DataFrames

julia> seq = reduce(hcat, collect.(["TGTCCGGCTCACCCACATAACCATATATATATATATAG"
              "TATATAATAACCATTAACCAATATATATGGTTATGTGG"
              "CAACATCATTAATTTANAGAGATTTTACTATGGAATAA"
              "TTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATT"
              "TTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT"]))
38×5 Matrix{Char}:
 'T'  'T'  'C'  'T'  'T'
 'G'  'A'  'A'  'T'  'T'
 'T'  'T'  'A'  'G'  'T'
 'C'  'A'  'C'  'T'  'T'
 'C'  'T'  'A'  'G'  'G'
 'G'  'A'  'T'  'T'  'T'
 'G'  'A'  'C'  'G'  'T'
 'C'  'T'  'A'  'A'  'G'
 'T'  'A'  'T'  'A'  'A'
 'C'  'A'  'T'  'T'  'C'
 'A'  'C'  'A'  'N'  'A'
 'C'  'C'  'A'  'C'  'G'
 'C'  'A'  'T'  'A'  'A'
 ⋮
 'T'  'T'  'T'  'T'  'C'
 'A'  'A'  'A'  'C'  'A'
 'T'  'T'  'C'  'A'  'A'
 'A'  'G'  'T'  'A'  'T'
 'T'  'G'  'A'  'A'  'A'
 'A'  'T'  'T'  'A'  'T'
 'T'  'T'  'G'  'G'  'A'
 'A'  'A'  'G'  'A'  'C'
 'T'  'T'  'A'  'A'  'T'
 'A'  'G'  'A'  'T'  'G'
 'T'  'T'  'T'  'A'  'T'
 'A'  'G'  'A'  'T'  'A'
 'G'  'G'  'A'  'T'  'T'

julia> df = DataFrame(seq, string.("seq", 1:5))
38×5 DataFrame
 Row │ seq1  seq2  seq3  seq4  seq5
     │ Char  Char  Char  Char  Char
─────┼──────────────────────────────
   1 │ T     T     C     T     T
   2 │ G     A     A     T     T
   3 │ T     T     A     G     T
   4 │ C     A     C     T     T
   5 │ C     T     A     G     G
   6 │ G     A     T     T     T
   7 │ G     A     C     G     T
   8 │ C     T     A     A     G
   9 │ T     A     T     A     A
  10 │ C     A     T     T     C
  11 │ A     C     A     N     A
  ⋮  │  ⋮     ⋮     ⋮     ⋮     ⋮
  28 │ T     T     C     A     A
  29 │ A     G     T     A     T
  30 │ T     G     A     A     A
  31 │ A     T     T     A     T
  32 │ T     T     G     G     A
  33 │ A     A     G     A     C
  34 │ T     T     A     A     T
  35 │ A     G     A     T     G
  36 │ T     T     T     A     T
  37 │ A     G     A     T     A
  38 │ G     G     A     T     T
                     16 rows omitted

julia> long = stack(df, :)
190×2 DataFrame
 Row │ variable  value
     │ String    Char
─────┼─────────────────
   1 │ seq1      T
   2 │ seq1      G
   3 │ seq1      T
   4 │ seq1      C
   5 │ seq1      C
   6 │ seq1      G
   7 │ seq1      G
   8 │ seq1      C
   9 │ seq1      T
  10 │ seq1      C
  11 │ seq1      A
  ⋮  │    ⋮        ⋮
 180 │ seq5      A
 181 │ seq5      T
 182 │ seq5      A
 183 │ seq5      T
 184 │ seq5      A
 185 │ seq5      C
 186 │ seq5      T
 187 │ seq5      G
 188 │ seq5      T
 189 │ seq5      A
 190 │ seq5      T
       168 rows omitted

julia> transform(long, :value => (x -> x .== ['T' 'G' 'C' 'A' 'N']) => [:T, :G, :C, :A, :N])
190×7 DataFrame
 Row │ variable  value  T      G      C      A      N
     │ String    Char   Bool   Bool   Bool   Bool   Bool
─────┼────────────────────────────────────────────────────
   1 │ seq1      T       true  false  false  false  false
   2 │ seq1      G      false   true  false  false  false
   3 │ seq1      T       true  false  false  false  false
   4 │ seq1      C      false  false   true  false  false
   5 │ seq1      C      false  false   true  false  false
   6 │ seq1      G      false   true  false  false  false
   7 │ seq1      G      false   true  false  false  false
   8 │ seq1      C      false  false   true  false  false
   9 │ seq1      T       true  false  false  false  false
  10 │ seq1      C      false  false   true  false  false
  11 │ seq1      A      false  false  false   true  false
  ⋮  │    ⋮        ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
 180 │ seq5      A      false  false  false   true  false
 181 │ seq5      T       true  false  false  false  false
 182 │ seq5      A      false  false  false   true  false
 183 │ seq5      T       true  false  false  false  false
 184 │ seq5      A      false  false  false   true  false
 185 │ seq5      C      false  false   true  false  false
 186 │ seq5      T       true  false  false  false  false
 187 │ seq5      G      false   true  false  false  false
 188 │ seq5      T       true  false  false  false  false
 189 │ seq5      A      false  false  false   true  false
 190 │ seq5      T       true  false  false  false  false
                                          168 rows omitted

jamaas · November 11, 2021, 2:36pm

Thanks for that. I’ve missed something here. I thought that the one hot coding would convert each individual letter, A, T, C, G into individual four-digit codes such as 1 0 0 0, 0 1 0 0, 0 0 1 0, and 0 0 0 1.

There may be much better ways in Julia, but I was looking for something equivalent to the keras.tensorflow function that is supposed to work like:

preprocessing.text.one_hot(
        input_text = input_object,  # dataframe?
        n = 4, # number of individual objects to code for, A, T, C, G
        filters = 'N', # filter out the N values
        lower = False,
        split = ' ')

BTW, I’ve tried this and not got it to work either, but I can see the logic. I’m happy to be told there are more precise and/or elegant ways to get the final result?

jamaas · November 11, 2021, 2:41pm

Thanks a bunch, very elegant. As I just wrote in the other note I was expecting the end product to be four individual four-digit codes but my expectation might be incorrect? Also the hard coding 1:5, is that 1:5 bases or 1:5 sequences? In the actual code I’d be running it on thousands of sequences.

nilshg · November 11, 2021, 2:47pm

One hot encoding generally turns a categorical variable into a group of vectors of one - your “four-digit code” essentially works row-wise:

Sequence	A	C	G	T	N
‘A’	1	0	0	0	0
‘T’	0	0	0	1	0

etc. - which is exactly what Bogumil shows (in my solution, the “A”, “C”, “G”, “T”, and “N” vector are the columns in the table above)

xiaodai · November 11, 2021, 10:37pm

shouldn’t there be a specialised structure to store these. Since there are only 4 possible values so 2 bits can be used to represent each value.

I am pretty sure I’ve read about specialised data structures that store these efficiently. Just not sure if they are in Julia just yet.

Topic		Replies	Views
Replacing one column with three (DataFrames.jl) Data dataframes	3	608	August 3, 2021
Learning Julia: Writing a onehot encoder Tooling	5	1469	October 23, 2019
Flux, categorical arrays, roc curves, confusion matrices Machine Learning flux	14	1053	December 12, 2022
Coming from python this took a while to figure out New to Julia	3	296	March 17, 2024
Recent experience with Julia as the main data science driver General Usage	18	3614	August 8, 2021

All the ways to do one-hot encoding

Related topics