Flux - LSTM - Issue with input format for multiple features

Hi community,

I’m diving into machine learning in Julia and currently having some issues with the input format for a LSTM model. The target is to predict the current activity of a human based on motion data from a smartphone:


The data has 9 features and a sequence length of 128. With batch loading via DataLoader, the input data has currently the format 9x128xbatch_size as a Array{Float32,3}. No matter how I try to apply the model (currently with broadcasting), the training quits with either a DimensionsMismatch or MethodError: no method matching *(::Array{Float32,2}, ::Array{Float32,3}).

I searched the documentation and examples for hints around input shape for multiple features, but I did not find any helpful.

The code of the project can be accessed under

Thanks for help!

1 Like

Hi @thomaszub!

Flux RNNs and broadcasting work in a different way than the python alternatives. In short, RNNs in julia don’t work on 3d-arrays (except when using CUDA, but this is handled at compile time I believe). Instead they work on arrays of matrices.

What this means for your example can be summarized in the following snippet:

julia> using Flux
julia> rnn = RNN(9, 1);
julia> lstm = LSTM(9, 1);
julia> x = rand(9, 128, 32);
julia> map(rnn, [view(x, :, t, :) for t ∈ 1:128])
julia> map(lstm, [view(x, :, t, :) for t ∈ 1:128])

I’m using map instead of the normal broadcast because of a bug when moving to zygote. It believe it is fixed on master, but don’t know what release it is apart of. Effectively the gradients were being truncated to only be one step back in time.


Thanks! This solved the issue.

Is there any reason in Flux to not use 3d arrays? I think this would make the API more convenient.

1 Like

Good question. I wasn’t around when Flux was originally written. But I assume it has to do with how matrix multiplication works in julia and Flux’s overall goal of being just normal looking Julia.

Imagine you have a chain

julia> model = chain(Dense(2, 10), RNN(10, 1))

The chain is just a convenient operator for chaining information forward. The dense layer is very simple under the hood resulting in this operation for an input vector x (you can find this for 0.11.6 here.

function (a::Dense)(x::AbstractArray)
  W, b, σ = a.W, a.b, a.σ
  σ.(W*x .+ b)

Now if we use 3d-arrays, this doesn’t work because W (which is a matrix) and x (which is now a 3d-array) don’t have default behavior for multiplication, and in fact there are many ways to do this where it doesn’t make sense for * to have default behavior. To make this work w/ 3-d arrays in the most “julian” way (at least by my reckoning) we would dispatch on the number of dimensions the array has and specialize. This would cause every layer (i.e. dense, recurrent, conv, etc…) to need to have a special function for these 3d arrays. Adding to the code complexity of Flux. What this function would look like is almost exactly (i think) what I wrote above, but now for each layer in the network individually. In fact what I wrote above should be just as fast as anything you would write here because for-loops are fast in Julia, so there isn’t much optimization to be had, and as far as I know there aren’t any specialized blas operations for this.

Instead, it makes more sense to use the broadcast or map functionality. This cuts down on specialized code in Flux, and makes running a model seem very julian/intuitve. There is no confusion about what the broadcast is doing, but in the 3d-array version now I have to check which strand is the time index.

It may seem more convenient to have support for 3d-arrays because this is what Tensorflow/Pytorch do. But they have to do this for efficiency as python for loops can be slow, so they just do the for loop in c instead. This isn’t a bad thing. And actually quite clever. It is just that Julia doesn’t have that constraint. So why design like we do.

1 Like

Of course. This doesn’t apply to GPU versions, where CUDNN specializes for 3d-arrays. But I’m 99% sure julia handles this at compilation time/when the data gets moved to the GPU (which needs to happen anyway), so the user facing code stays the same (I’m pretty sure) and dispatch is used to do all the onerous stuff.

1 Like

Thanks for the explanation.

You gave good arguments about the design. From the mathematical view it makes now even more sense to me as the time nature is handled by reapplying the network sequentially to the time series. So Julia maps more “natural” from this perspective.

1 Like

Hello, I had a look at the code you posted after using ‘map’. Do you mind if I ask why you only took the last row in the function ‘‘apply’’? Isn’t that like basically calculating only the last value and disregarding all other values?
I thought a more sensible approach would be to take all values produced by ‘map’ and creating your own loss function as below. Let me know your input:

function lossfn(x, y)
y_model = eval_model(x)
#adjusting the loss function for mean square error Flux.mse(y_model,y)
diff = [mean(abs2, y[i] .- [ x[i] for x in y_model ]) for i = 1:length(y) ]
return mean(diff)