All the ways to do one-hot encoding

There are many ways to do one hot encoding in Julia. I want to list down some ways

Method Comment
Roll your own It’s not too hard
Flux.onehot & Flux.onehotbatch pretty heavy to depend on Flux so normally avoided unless u r using Flux anyway
DataConvenience.onehot can be applied directly to dataframes
FeatureTransform.OneHotEncoder They syntax for FeatureTransform is still on the verbose side for my liking imo
MLJ.OneHotEncoder() I really want to like MLJ but it’s quite heavy and makes me want to avoid it. What’s with this machine? Why do I need a machine to do one hot encoding?
ScikitLearn.jl Hmm, need to install a bunch of Python stuff. Thank you no thank you for such a simple thing
Keep your data in categorical format You need to find libraries that accept that
unique(x) .== permutedims(x) due to @Mattriks. Great for if you don’t need sparse representation

That’s it. Anything I’ve missed?

15 Likes
julia> using StatsBase

julia> x = ["a", "b", "c", "a", "b", "d"]
6-element Vector{String}:
 "a"
 "b"
 "c"
 "a"
 "b"
 "d"

julia> indicatormat(x)
4×6 Matrix{Bool}:
 1  0  0  1  0  0
 0  1  0  0  1  0
 0  0  1  0  0  0
 0  0  0  0  0  1

but normally I do your option 1 :smiley: :

julia> df = DataFrame(x=x)
6×1 DataFrame
 Row │ x
     │ String
─────┼────────
   1 │ a
   2 │ b
   3 │ c
   4 │ a
   5 │ b
   6 │ d

julia> select(df, [:x => ByRow(isequal(v))=> Symbol(v) for v in unique(df.x)])
6×4 DataFrame
 Row │ a      b      c      d
     │ Bool   Bool   Bool   Bool
─────┼────────────────────────────
   1 │  true  false  false  false
   2 │ false   true  false  false
   3 │ false  false   true  false
   4 │  true  false  false  false
   5 │ false   true  false  false
   6 │ false  false  false   true

(DataConvenience.jl is nice :+1:)

8 Likes

Yeah. Nice. I think the issue with DataConvenience.jl is that it’s not designed for production use. So missing categories at production time is harder to handle.

unique(x) .== permutedims(x)
40 Likes

StatsModels.jl DummyCoding
https://juliastats.org/StatsModels.jl/stable/contrasts/#StatsModels.DummyCoding

5 Likes

Mind Blown.

julia> x = [1, 2, 1, 3, 2];

julia> unique(x) .== permutedims(x)
3×5 BitMatrix:
 1  0  1  0  0
 0  1  0  0  1
 0  0  0  1  0

That is so elegant.

20 Likes

:astonished:

Brilliant! Not sure what permutedims does in case x is multidimensional, so maybe the following is slightly more general:

unique(x) .== reshape(x, (1, size(x)…))
2 Likes

how do you one-hot encode a multidimensional x?


It’s so nice our broadcast would do outer automatically :wink:

Not in a million years would I have thought of that. :joy: Awesome :+1:t2:

Flux’s OneHotArray has gotten a lot of work recently and is pretty robust. It’s also self-contained, so we could theoretically pull it into a separate package if there’s enough interest.

9 Likes

https://github.com/cossio/OneHot.jl

a doc or simple instructions in the readme would be helpful.

1 Like

Ok, here is your challenge. I’m new to this and struggling to get any onehot conversion to work. This is a MWE but here are some very small DNA sequences, what is the best way to “onehot” convert this object, which I think would normally be in a dataframe, for ml? Note there are the standard 4 bases and a couple of unknown “N”'s in the reads. Thx. J

TGTCCGGCTCACCCACATAACCATATATATATATATAG
TATATAATAACCATTAACCAATATATATGGTTATGTGG
CAACATCATTAATTTANAGAGATTTTACTATGGAATAA
TTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATT
TTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT

Not entirely sure how this fits into a DataFrame or what output you’re expecting, but one way:

julia> s = "TGTCCGGCTCACCCACATAACCATATATATATATATAGTATATAATAACCATTAACCAATATATATGGTTATGTGGCAACATCATTAATTTANAGAGATTTTACTATGGAATAATTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATTTTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT"
"TGTCCGGCTCACCCACATAACCATATATATATATATAGTATATAATAACCATTAACCAATATATATGGTTATGTGGCAACATCATTAATTTANAGAGATTTTACTATGGAATAATTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATTTTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT"

julia> [collect(s) .== x for x ∈ ['A', 'C', 'G', 'T', 'N']]
5-element Vector{BitVector}:
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 1, 0, 1, 0, 0, 0, 0, 1, 0]
 [0, 0, 0, 1, 1, 0, 0, 1, 0, 1  …  0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
 [0, 1, 0, 0, 0, 1, 1, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
 [1, 0, 1, 0, 0, 0, 0, 0, 1, 0  …  1, 0, 1, 0, 0, 1, 0, 1, 0, 1]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

(where the first vector is 1 when character is A, the second vector is 1 when the character is C etc)

1 Like

I assume there are 5 sequences, so you can e.g. do to also keep track of the sequence ID:

julia> using DataFrames

julia> seq = reduce(hcat, collect.(["TGTCCGGCTCACCCACATAACCATATATATATATATAG"
              "TATATAATAACCATTAACCAATATATATGGTTATGTGG"
              "CAACATCATTAATTTANAGAGATTTTACTATGGAATAA"
              "TTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATT"
              "TTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT"]))
38×5 Matrix{Char}:
 'T'  'T'  'C'  'T'  'T'
 'G'  'A'  'A'  'T'  'T'
 'T'  'T'  'A'  'G'  'T'
 'C'  'A'  'C'  'T'  'T'
 'C'  'T'  'A'  'G'  'G'
 'G'  'A'  'T'  'T'  'T'
 'G'  'A'  'C'  'G'  'T'
 'C'  'T'  'A'  'A'  'G'
 'T'  'A'  'T'  'A'  'A'
 'C'  'A'  'T'  'T'  'C'
 'A'  'C'  'A'  'N'  'A'
 'C'  'C'  'A'  'C'  'G'
 'C'  'A'  'T'  'A'  'A'
 ⋮
 'T'  'T'  'T'  'T'  'C'
 'A'  'A'  'A'  'C'  'A'
 'T'  'T'  'C'  'A'  'A'
 'A'  'G'  'T'  'A'  'T'
 'T'  'G'  'A'  'A'  'A'
 'A'  'T'  'T'  'A'  'T'
 'T'  'T'  'G'  'G'  'A'
 'A'  'A'  'G'  'A'  'C'
 'T'  'T'  'A'  'A'  'T'
 'A'  'G'  'A'  'T'  'G'
 'T'  'T'  'T'  'A'  'T'
 'A'  'G'  'A'  'T'  'A'
 'G'  'G'  'A'  'T'  'T'

julia> df = DataFrame(seq, string.("seq", 1:5))
38×5 DataFrame
 Row │ seq1  seq2  seq3  seq4  seq5
     │ Char  Char  Char  Char  Char
─────┼──────────────────────────────
   1 │ T     T     C     T     T
   2 │ G     A     A     T     T
   3 │ T     T     A     G     T
   4 │ C     A     C     T     T
   5 │ C     T     A     G     G
   6 │ G     A     T     T     T
   7 │ G     A     C     G     T
   8 │ C     T     A     A     G
   9 │ T     A     T     A     A
  10 │ C     A     T     T     C
  11 │ A     C     A     N     A
  ⋮  │  ⋮     ⋮     ⋮     ⋮     ⋮
  28 │ T     T     C     A     A
  29 │ A     G     T     A     T
  30 │ T     G     A     A     A
  31 │ A     T     T     A     T
  32 │ T     T     G     G     A
  33 │ A     A     G     A     C
  34 │ T     T     A     A     T
  35 │ A     G     A     T     G
  36 │ T     T     T     A     T
  37 │ A     G     A     T     A
  38 │ G     G     A     T     T
                     16 rows omitted

julia> long = stack(df, :)
190×2 DataFrame
 Row │ variable  value
     │ String    Char
─────┼─────────────────
   1 │ seq1      T
   2 │ seq1      G
   3 │ seq1      T
   4 │ seq1      C
   5 │ seq1      C
   6 │ seq1      G
   7 │ seq1      G
   8 │ seq1      C
   9 │ seq1      T
  10 │ seq1      C
  11 │ seq1      A
  ⋮  │    ⋮        ⋮
 180 │ seq5      A
 181 │ seq5      T
 182 │ seq5      A
 183 │ seq5      T
 184 │ seq5      A
 185 │ seq5      C
 186 │ seq5      T
 187 │ seq5      G
 188 │ seq5      T
 189 │ seq5      A
 190 │ seq5      T
       168 rows omitted

julia> transform(long, :value => (x -> x .== ['T' 'G' 'C' 'A' 'N']) => [:T, :G, :C, :A, :N])
190×7 DataFrame
 Row │ variable  value  T      G      C      A      N
     │ String    Char   Bool   Bool   Bool   Bool   Bool
─────┼────────────────────────────────────────────────────
   1 │ seq1      T       true  false  false  false  false
   2 │ seq1      G      false   true  false  false  false
   3 │ seq1      T       true  false  false  false  false
   4 │ seq1      C      false  false   true  false  false
   5 │ seq1      C      false  false   true  false  false
   6 │ seq1      G      false   true  false  false  false
   7 │ seq1      G      false   true  false  false  false
   8 │ seq1      C      false  false   true  false  false
   9 │ seq1      T       true  false  false  false  false
  10 │ seq1      C      false  false   true  false  false
  11 │ seq1      A      false  false  false   true  false
  ⋮  │    ⋮        ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
 180 │ seq5      A      false  false  false   true  false
 181 │ seq5      T       true  false  false  false  false
 182 │ seq5      A      false  false  false   true  false
 183 │ seq5      T       true  false  false  false  false
 184 │ seq5      A      false  false  false   true  false
 185 │ seq5      C      false  false   true  false  false
 186 │ seq5      T       true  false  false  false  false
 187 │ seq5      G      false   true  false  false  false
 188 │ seq5      T       true  false  false  false  false
 189 │ seq5      A      false  false  false   true  false
 190 │ seq5      T       true  false  false  false  false
                                          168 rows omitted
1 Like

Thanks for that. I’ve missed something here. I thought that the one hot coding would convert each individual letter, A, T, C, G into individual four-digit codes such as 1 0 0 0, 0 1 0 0, 0 0 1 0, and 0 0 0 1.

There may be much better ways in Julia, but I was looking for something equivalent to the keras.tensorflow function that is supposed to work like:

preprocessing.text.one_hot(
        input_text = input_object,  # dataframe?
        n = 4, # number of individual objects to code for, A, T, C, G
        filters = 'N', # filter out the N values
        lower = False,
        split = ' ')

BTW, I’ve tried this and not got it to work either, but I can see the logic. I’m happy to be told there are more precise and/or elegant ways to get the final result?

Thanks a bunch, very elegant. As I just wrote in the other note I was expecting the end product to be four individual four-digit codes but my expectation might be incorrect? Also the hard coding 1:5, is that 1:5 bases or 1:5 sequences? In the actual code I’d be running it on thousands of sequences.

One hot encoding generally turns a categorical variable into a group of vectors of one - your “four-digit code” essentially works row-wise:

Sequence A C G T N
‘A’ 1 0 0 0 0
‘T’ 0 0 0 1 0

etc. - which is exactly what Bogumil shows (in my solution, the “A”, “C”, “G”, “T”, and “N” vector are the columns in the table above)

shouldn’t there be a specialised structure to store these. Since there are only 4 possible values so 2 bits can be used to represent each value.

I am pretty sure I’ve read about specialised data structures that store these efficiently. Just not sure if they are in Julia just yet.