model = Chain(Embed(200,37),Dense(200,150),Dense(150,100))

The output of Embed (From Transformers.jl) is an embedding matrix of size (200,100) for a size 100 input vector. Meaning at the end of this model, the result is a 100x100 matrix.

I want for each character(line in matrix) along dimension 2 to have one of 14 possibilities (perhaps through a softmax or similar). Meaning that I need to run a Dense(100,14),softmax through each column (first dimension) of the output matrix. How to do that? The Dense is preferably the same for all lines, the model is already a bit too big.

Does 100 represent the number of characters? If so, you’re missing a batch dimension. I’d have a look at your input and make sure it has the correct shape (sequence_length x batch).

Each input array is of length 100, and each output array (From the network) is of dimensions (23x100). I simply turned my problem into a classification problem for each character (One of the 100), so the data labels are now of size (23x100) each. (One Hot arrays)

My input is now a tensor of dimensions (100xN), and my output is a tensor of dimensions (23x100xN), where N is the lines in my document. I then use DataLoader to cut it into batches and feed it to the network. My issue now is low memory (I only have 8GB, 1.8 of which is used to load 20% of the training dataset), but I’m going to use another computer for it.

Also, in my case since I’m interested in the classification of 100 elements, I suppose it’s fine for the loss is around ~2.1 per line (input array).