lmiq
November 12, 2021, 1:15am
22
Maybe is something like this what you want?
julia> struct OneHotCode{T} <: AbstractVector{T}
i::T
length::T
end
julia> Base.getindex(o::OneHotCode,i) = i == o.i ? 1 : 0
julia> Base.size(o::OneHotCode) = (o.length,)
julia> Base.length(o::OneHotCode) = o.length
julia> function conv_seq(seq::String)
seq_out = OneHotCode{Int8}[]
for char in seq
char == 'A' ? hot = 1 :
char == 'T' ? hot = 2 :
char == 'C' ? hot = 3 :
char == 'G' ? hot = 4 :
char == 'N' ? hot = 5 :
error("Ilegal base")
push!(seq_out,OneHotCode(Int8(hot),Int8(5)))
end
return seq_out
end
conv_seq (generic function with 1 method)
julia> s = "TGTCCGGCTCACCCACATAACCATATATATATATATAGTATATAATAACCATTAACCAATATATATGGTTATGTGGCAACATCATTAATTTANAGAGATTTTACTATGGAATAATTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATTTTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT"
"TGTCCGGCTCACCCACATAACCATATATATATATATAGTATATAATAACCATTAACCAATATATATGGTTATGTGGCAACATCATTAATTTANAGAGATTTTACTATGGAATAATTGTGTGAATNCAGATTTTCAAGGCTCAAAAGAATATTTTTTGTTGACAGAAAAAAGAATAATCAATATACTGTAT"
julia> conv_seq(s)
190-element Vector{OneHotCode{Int8}}:
[0, 1, 0, 0, 0]
[0, 0, 0, 1, 0]
⋮
[1, 0, 0, 0, 0]
[0, 1, 0, 0, 0]
A student of mine played with something like that to represent aminoacids in a ML code.
Looking at all this, maybe one-hot should be added to base?
No, these things belong in packages.
7 Likes
jamaas
November 15, 2021, 2:56pm
25
Thanks for this, yes it is exactly what I need. My only question is whether it is possible to get the same result using one of canned functions already built into a package such as Flux, MLBase, or MLJ ?
In MLJ, by specifying that some feature or some target of your data is either Multiclass
or OrderedFactor
(*) you don’t have to bother with OHE… MLJ passes this information down to the model interface which does the most appropriate encoding there for the model you intend to use without requiring you to roll your own.
(*) depending on your case, so for the ATGC case it would be Multiclass{4}
and for some scoring it would be OrderedFactor
. See also ScientificTypes.jl .
6 Likes
e3c6
December 4, 2021, 4:58pm
29
Thanks! I think I would only add a sort
, so that the mapping is more intuitive.
sort(unique(x)) .== reshape(x, (1, size(x)…))
x=["a","b","a","c","c","b"]
julia> map(x->isequal(x...), Base.product(x, unique(x)))
6×3 Matrix{Bool}:
1 0 0
0 1 0
1 0 0
0 0 1
0 0 1
0 1 0
julia> Base.splat(isequal).(Base.product(x, unique(x)))
6×3 BitMatrix:
1 0 0
0 1 0
1 0 0
0 0 1
0 0 1
0 1 0
mapreduce(u->tuple.(u[1],findall(==(u[2]),x)),vcat, enumerate(unique(x)))
6-element Vector{Tuple{Int64, Int64}}:
(1, 1)
(1, 3)
(2, 2)
(2, 6)
(3, 4)
(3, 5)
mapreduce(((i,u),)->tuple.(i,findall(==(u),x)).=>1,vcat, enumerate(unique(x)))
6-element Vector{Pair{Tuple{Int64, Int64}, Int64}}:
(1, 1) => 1
(1, 3) => 1
(2, 2) => 1
(2, 6) => 1
(3, 4) => 1
(3, 5) => 1
2 Likes
I created a simple package to perform this on BioSequence
types. It seems to be one of the most efficient to do it:
A small Julia package to represent BioSequences as a Voss matrix
PD: Sorry for necroposting…
4 Likes