Coming from python this took a while to figure out

# throws a non descriptive error
prot_alphabet=["R","H","K",
               "D","E",
               "S","T","N","C","Q",
               "C","U","G","P",
               "A","V","I","L","M","F","Y","W"]

using Flux: onehotbatch
tokenise(s) = onehotbatch(s, prot_alphabet)

prot_seq="GAQLLNYASYFAKMAIKLDRKG"
tokenise(prot_seq)

vs

# gives a nice OneHotMatrix
prot_alphabet=['R','H','K',
               'D','E',
               'S','T','N','C','Q',
               'C','U','G','P',
               'A','V','I','L','M','F','Y','W']

using Flux: onehotbatch
tokenise(s) = onehotbatch(s, prot_alphabet)

prot_seq="GAQLLNYASYFAKMAIKLDRKG"
tokenise(prot_seq)

its the string vs char of course
but what to do if I want to hotone encode something where the labels are not single chars but longer strings

Here your first input is a String meaning a list of char which is why he wants a vector of char as second input. If you were using a vector of String (or substring ) instead then you would need a vector of String on both

In the case you would have a big String and you want to tokenize it you would firt split it then embed it for example for word tokens
s_split = split(s," ")
oh_s = onehotbatch(s_split, unique(s_split))

For special vocab I would still go with a loop to split and recognise parts of the vocab.

Thanks, super clear now. It’s a steep learning curve. but i’ll get there

1 Like