Found Bug in Flux

I did indeed read the original thread, but after this clarification I’m afraid I’m more confused about the use case. This is generally why we ask for MWEs, but in this case I think it’s more a question of having a description of everything else:

  1. What is the data modality? I find this better helps decide what format the data should be in rather than the other way around. For example, lower-frequency, ragged data from a medical record might be more appropriate for a time-distributed representation, whereas higher frequency sensor data would be better in a dense, non-ragged one.
  2. What is the task? Here I mean not how it should be done, but what the objective is. I ask this specifically because nesting sequence models on higher-dimensional as you describe in the original thread is highly unusual and unlikely to perform well in any framework if implemented as exactly described, so perhaps there is some alternative formulation which is more appropriate for the task at hand.
  3. Are there existing implementations of the techniques you’re trying on similar datasets? Translating from someone else’s existing code and/or algorithm is far easier than starting from whole cloth, as the hard work of validating said algorithm has already been done!
  4. If not, can you write out pseudo-code / an algorithm with special focus on exemplary dummy data? Then it’s more clear what the inputs look like (not just structurally, but content-wise too), how they flow through the model, what the targets look like, how the two are compared, etc.