Coming from python this took a while to figure out

Fenrir_Sivar · March 17, 2024, 6:38am

# throws a non descriptive error
prot_alphabet=["R","H","K",
               "D","E",
               "S","T","N","C","Q",
               "C","U","G","P",
               "A","V","I","L","M","F","Y","W"]

using Flux: onehotbatch
tokenise(s) = onehotbatch(s, prot_alphabet)

prot_seq="GAQLLNYASYFAKMAIKLDRKG"
tokenise(prot_seq)

vs

# gives a nice OneHotMatrix
prot_alphabet=['R','H','K',
               'D','E',
               'S','T','N','C','Q',
               'C','U','G','P',
               'A','V','I','L','M','F','Y','W']

using Flux: onehotbatch
tokenise(s) = onehotbatch(s, prot_alphabet)

prot_seq="GAQLLNYASYFAKMAIKLDRKG"
tokenise(prot_seq)

its the string vs char of course
but what to do if I want to hotone encode something where the labels are not single chars but longer strings

yolhan_mannes · March 17, 2024, 8:18am

Here your first input is a String meaning a list of char which is why he wants a vector of char as second input. If you were using a vector of String (or substring ) instead then you would need a vector of String on both

yolhan_mannes · March 17, 2024, 8:25am

In the case you would have a big String and you want to tokenize it you would firt split it then embed it for example for word tokens
s_split = split(s," ")
oh_s = onehotbatch(s_split, unique(s_split))

For special vocab I would still go with a loop to split and recognise parts of the vocab.

Fenrir_Sivar · March 17, 2024, 8:28am

Thanks, super clear now. It’s a steep learning curve. but i’ll get there

Topic		Replies	Views
Learning Julia: Writing a onehot encoder Tooling	5	1492	October 23, 2019
All the ways to do one-hot encoding General Usage	30	11451	October 20, 2024
How to onehot encode batches of sequences? General Usage array , linearalgebra , flux , machine-learning	2	747	January 17, 2023
Using a onehot vector together with other data Machine Learning question	2	491	September 23, 2019
Generating dummy variables from a vector of strings (one-hot encoding) New to Julia	9	3347	July 31, 2021

Coming from python this took a while to figure out

Related topics