All the ways to do one-hot encoding

Maybe is something like this what you want?

julia> struct OneHotCode{T} <: AbstractVector{T}
           i::T
           length::T
       end

julia> Base.getindex(o::OneHotCode,i) = i == o.i ? 1 : 0

julia> Base.size(o::OneHotCode) = (o.length,)

julia> Base.length(o::OneHotCode) = o.length

julia> function conv_seq(seq::String)
           seq_out = OneHotCode{Int8}[]
           for char in seq
               char == 'A' ? hot = 1 :
               char == 'T' ? hot = 2 :
               char == 'C' ? hot = 3 :
               char == 'G' ? hot = 4 :
               char == 'N' ? hot = 5 :
               error("Ilegal base")
               push!(seq_out,OneHotCode(Int8(hot),Int8(5)))
           end
           return seq_out
       end
conv_seq (generic function with 1 method)

julia> s = "TGTCCGGCTCACCCACATAACCATATATATATATATAGTATATAATAACCATTAACCAATATATATGGTTATGTGGCAACATCATTAATTTANAGAGATTTTACTATGGAATAATTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATTTTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT"
"TGTCCGGCTCACCCACATAACCATATATATATATATAGTATATAATAACCATTAACCAATATATATGGTTATGTGGCAACATCATTAATTTANAGAGATTTTACTATGGAATAATTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATTTTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT"

julia> conv_seq(s)
190-element Vector{OneHotCode{Int8}}:
 [0, 1, 0, 0, 0]
 [0, 0, 0, 1, 0]
 ⋮
 [1, 0, 0, 0, 0]
 [0, 1, 0, 0, 0]

A student of mine played with something like that to represent aminoacids in a ML code.

Looking at all this, maybe one-hot should be added to base?

No, these things belong in packages.

7 Likes

Thanks for this, yes it is exactly what I need. My only question is whether it is possible to get the same result using one of canned functions already built into a package such as Flux, MLBase, or MLJ ?

In MLJ, by specifying that some feature or some target of your data is either Multiclass or OrderedFactor (*) you don’t have to bother with OHE… MLJ passes this information down to the model interface which does the most appropriate encoding there for the model you intend to use without requiring you to roll your own.

(*) depending on your case, so for the ATGC case it would be Multiclass{4} and for some scoring it would be OrderedFactor. See also ScientificTypes.jl.

6 Likes

this is nice.

One-Hot Encoding · Flux ?

1 Like

Thanks! I think I would only add a sort, so that the mapping is more intuitive.

sort(unique(x)) .== reshape(x, (1, size(x)…))
x=["a","b","a","c","c","b"]
julia> map(x->isequal(x...), Base.product(x, unique(x)))
6×3 Matrix{Bool}:
 1  0  0
 0  1  0
 1  0  0
 0  0  1
 0  0  1
 0  1  0
julia> Base.splat(isequal).(Base.product(x, unique(x)))
6×3 BitMatrix:
 1  0  0
 0  1  0
 1  0  0
 0  0  1
 0  0  1
 0  1  0
mapreduce(u->tuple.(u[1],findall(==(u[2]),x)),vcat, enumerate(unique(x)))
6-element Vector{Tuple{Int64, Int64}}:
 (1, 1)
 (1, 3)
 (2, 2)
 (2, 6)
 (3, 4)
 (3, 5)
mapreduce(((i,u),)->tuple.(i,findall(==(u),x)).=>1,vcat, enumerate(unique(x)))
6-element Vector{Pair{Tuple{Int64, Int64}, Int64}}:
 (1, 1) => 1
 (1, 3) => 1
 (2, 2) => 1
 (2, 6) => 1
 (3, 4) => 1
 (3, 5) => 1
2 Likes

FYI: https://github.com/JuliaRegistries/General/pull/58049

2 Likes

I created a simple package to perform this on BioSequence types. It seems to be one of the most efficient to do it:

PD: Sorry for necroposting…

4 Likes