Coming from python this took a while to figure out

# throws a non descriptive error

using Flux: onehotbatch
tokenise(s) = onehotbatch(s, prot_alphabet)



# gives a nice OneHotMatrix

using Flux: onehotbatch
tokenise(s) = onehotbatch(s, prot_alphabet)


its the string vs char of course
but what to do if I want to hotone encode something where the labels are not single chars but longer strings

Here your first input is a String meaning a list of char which is why he wants a vector of char as second input. If you were using a vector of String (or substring ) instead then you would need a vector of String on both

In the case you would have a big String and you want to tokenize it you would firt split it then embed it for example for word tokens
s_split = split(s," ")
oh_s = onehotbatch(s_split, unique(s_split))

For special vocab I would still go with a loop to split and recognise parts of the vocab.

Thanks, super clear now. It’s a steep learning curve. but i’ll get there

1 Like