Found Bug in Flux

My Flux model chain works perfectly when given an input or when calculating the loss of my training data. However, when I try to calculate the gradient my model chain undergoes unexpected behaviour.

My input data is a 3D tensor with shape (n x m x p). My model chain begins with c=Chain(DataLoader, rnns. This loads p 2D slices of the 3D tensor into the rnns function which for loops through doing [rnn(slice) in slices][end] (using a Flux RNN). The result of this operation is m vectors (columns) which have passed through the RNN. Thus it outputs an (n x m) matrix. This works for calculating output and loss.

However, this does not work when calculating the gradient. When calculating the gradient, for some reason the DataLoader simply converts the 3D CuArray to a 3D (cpu) Array. It does not batch the data into DataLoader type slices as it does when running input normally through the model chain or when calculating the loss.

I have attached a MWE below. As you can see, everything works as expected until the last line of code, the gradient calculation, which gives an error because the for loop through the rnn is iterating over the wrong type (Float32 Element instead of CuArrays in the DataLoader).

Thank you,
Jack

using Flux
using MLUtils
using CUDA

struct RNNModel
    rnn_model::Flux.Recur
end

function (m::RNNModel)(data)
    cur = nothing

    for i in data # dataloader
        println("hi!")
        println(typeof(i))
        println("typed")
        println(i)
        println("printed")
        cur = m.rnn_model(i::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
    end

    return cur
end

Flux.@functor RNNModel


n = 5
m = 4
p = 3

function printdata(data)
    println(data)
    println(typeof(data))
    println("data printed")
    return data
end

rnn = RNN(n=>n) |> gpu
rnns = RNNModel(rnn) |> gpu
c = Chain(DataLoader, printdata, rnns, softmax)
loss(x, y_target) = Flux.Losses.crossentropy(c(x), y_target)

x_train = randn(n, m, p) |> gpu
y_train = randn(n, m, 1)|>softmax |>gpu

c(x_train)

loss(x_train, y_train)

gradient(()->loss(x_train, y_train), Flux.params(c))

I think the confusion here comes from trying to use DataLoader as part of a model. Much as you wouldn’t write nn.Sequential(DataLoader(...), ...) in PyTorch, data loading in Flux lives outside of your model’s forward+backwards passes. Instead, write your training loop much as you would for any other framework:

# Pseudo-Julia
# ... initialize model, grab parameters, etc.
dl = DataLoader(...)
for (x, y) in dl
  # Notice how dl does not show up here!
  grads = gradient(() -> loss(model, x, y), ...)
end

We like to think we’ve covered this in the data loading and training loop sections of the docs, but if you feel things could be improved please let us know. In general though, Flux is no different from other libraries when it comes to the separation between data and model.

1 Like

Thank you for your response. My MWE does not include the full model chain, which may provide some context for why I require the DataLoader (or some other way of slicing within the model chain). I have included the full chain at the bottom of this post.

My initial post on the topic is here How to take full advantage of GPU Parallelism on Nested Sequential Data in Flux - #4 by jonathan-laurent. In the MWE I gave above in this question, (m, n, p) translates to (num_features, max_inner_seq_length, max_outter_seq_length * batch_size) in @jonathan-laurent’s excellent answer.

@jonathan-laurent gave me very helpful advice on how to process this data while taking advantage of the fast, parallel nature of the GPU (which worked with my implementation (with DataLoader) up until I tried to compute the gradients).

As you can see by reading his answer, not only do I need to for loop through slices at the beginning of the chain (which I agree could be done outside of the chain), but in the middle of the model chain I need to reshape the data from 2D to 3D and again for-loop through slices.

Since this second DataLoader application and data reshaping must be applied in between the application of my two recurrent models, there is no way to avoid having a method to slice 3D data in the model chain. You have made it clear to me that DataLoader likely is not the right choice.

I am very much open to suggestions and I would love to know what you would recommend in order to create slices of a 3D tensor within the model chain and reshape a 2D tensor to a 3D tensor within the model chain, while ensuring the gradient can be calculated.

Moreover, I am very much open to any better ways of implementing @jonathan-laurent’s excellent advice for computing my model using the GPU (and thus tensors).

Please feel free to jump in as well @jonathan-laurent if you know how I can implement your excellent advice too!

Thank you,

Jack


For reference, my full model chain at the moment is below, which again, fully works for computing outputs given inputs or for computing the loss given inputs and targets, but does not work when it comes to computing the gradient.

Please let me know if you have any questions.

c = Chain(
          DataLoader,
          rnns,
          d->d[:, :, 1],
          d->reshape(d, (output_feature_len, max_outer_sequence_len, :)),
          d->permutedims(d, (1, 3, 2)),
          DataLoader,
          rnns2,
          d->d[:, :, 1],
          softmax
) |> gpu

I did indeed read the original thread, but after this clarification I’m afraid I’m more confused about the use case. This is generally why we ask for MWEs, but in this case I think it’s more a question of having a description of everything else:

  1. What is the data modality? I find this better helps decide what format the data should be in rather than the other way around. For example, lower-frequency, ragged data from a medical record might be more appropriate for a time-distributed representation, whereas higher frequency sensor data would be better in a dense, non-ragged one.
  2. What is the task? Here I mean not how it should be done, but what the objective is. I ask this specifically because nesting sequence models on higher-dimensional as you describe in the original thread is highly unusual and unlikely to perform well in any framework if implemented as exactly described, so perhaps there is some alternative formulation which is more appropriate for the task at hand.
  3. Are there existing implementations of the techniques you’re trying on similar datasets? Translating from someone else’s existing code and/or algorithm is far easier than starting from whole cloth, as the hard work of validating said algorithm has already been done!
  4. If not, can you write out pseudo-code / an algorithm with special focus on exemplary dummy data? Then it’s more clear what the inputs look like (not just structurally, but content-wise too), how they flow through the model, what the targets look like, how the two are compared, etc.

Thank you for your response. Unfortunately I cannot go into the details of the data but I can try to give you some more clarity. As mentioned in my original post here How to take full advantage of GPU Parallelism on Nested Sequential Data in Flux - #4 by jonathan-laurent, I have data in the form of sequences of sequences which I would like to analyze. While I could probably just concatenate the sub-sequences and process with only one recurrent model as one sequence, I believe that the model structure I have chosen will much better be able to follow the structure of my data.

I appreciate your advice to use a more common approach, but my main question now is simply, how I can process the columns of 2D slices of 3D data one slice at a time in a Flux model chain which still allows for the gradient to be computed and takes advantage of processing each column of the slice in parallel on a GPU?

I believe that my initial MWE on this post clearly shows the data manipulation I am trying to do with 3D data. I would greatly appreciate it if you could give me some direction on how to implement that in a way in which the gradient can be computed and which can be computed taking advantage of the GPU. Then I should be able to apply that to my full model chain!

Thank you for your help,

Jack

I suppose where I’m confused is in the discrepancy between the problem description and the code shown here. Per the docs, Flux RNN layers operate on inputs of shape ((features, batch), timesteps) (i.e. AbstractVector{AbstractMatrix}) or (features, batch, timesteps) (i.e. AbstractArray{..., 3}). Here it seems like you’re alternating between inner and outer sequence dimensions as features and timesteps, but are those sequences always guaranteed to be the same length across samples in the same batch? If so, then the easist way to go is to remove the DataLoaders from your Chain entirely and rely on the RNN interface’s native support for RNN(array of (features, batch, timesteps)) -> array of (features, batch, timesteps).

3 Likes

Thank you very much for your response. I tried to implement it but am now running into another issue.

I am now processing data as suggested, in the shape vector of matrices. The shape is thus ((features, batch), timesteps). My chain simply recurs through each matrix and returns a matrix. The loss is then computed column-wise with a target matrix.

I tried to compute the gradient of this model in order to train it, but I encountered an error that I do not understand nor know how to solve. It is occurs when calculating the gradient of all the data together, as seen in the last two lines of my MWE below. Any help solving the error would be very much appreciated. Also, please let me know if there is any other way I should implement your suggestion.

Thank you,

Jack

using Flux
using MLUtils
using CUDA

struct RNNModel
    rnn_model::Flux.Recur
end

function (m::RNNModel)(data)
    cur = nothing

    for i in data # dataloader
        println("hi!")
        @info typeof(data)
        println(typeof(i))
        println("typed")
        println(i)
        println("printed")
        cur = m.rnn_model(i::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    end

    return cur
end

Flux.@functor RNNModel


n = 5
m = 4
p = 3

function printdata(data)
    println(data)
    println(typeof(data))
    println("data printed")
    return data
end

rnn = RNN(n, n) |> gpu
rnns = RNNModel(rnn) |> gpu
c = Chain(printdata, rnns, softmax) |> gpu
loss(x::Vector{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, y_target::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}) = Flux.Losses.crossentropy(c(x), y_target)
x_train = [randn(n, p) for i in 1:m] |> gpu
y_train = randn(n, p)|>softmax |>gpu

y_predict = c(x_train)

loss(x_train, y_train)

Flux.Optimise.train!(loss, Flux.params(c), [(x_train, y_train)], Descent(0.001))
gradient(()->loss(x_train, y_train), Flux.params(c))
ERROR: Compiling Tuple{RNNModel, Vector{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}: try/catch is not supported.
Refer to the Zygote documentation for fixes.
https://fluxml.ai/Zygote.jl/dev/limitations.html#Try-catch-statements-1

Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:33
  [2] instrument(ir::IRTools.Inner.IR)
    @ Zygote C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\reverse.jl:121
  [3] #Primal#23
    @ C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\reverse.jl:205 [inlined]
  [4] Zygote.Adjoint(ir::IRTools.Inner.IR; varargs::Nothing, normalise::Bool)
    @ Zygote C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\reverse.jl:322
  [5] _generate_pullback_via_decomposition(T::Type)
    @ Zygote C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\emit.jl:101
  [6] #s3106#1162
    @ C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\interface2.jl:28 [inlined]
  [7] var"#s3106#1162"(::Any, ctx::Any, f::Any, args::Any)
    @ Zygote .\none:0
  [8] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
    @ Core .\boot.jl:580
  [9] _pullback
    @ C:\Users\jackn\.julia\packages\Flux\BPPNj\src\layers\basic.jl:47 [inlined]
 [10] _pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{RNNModel, typeof(softmax)}, ::Vector{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}})
    @ Zygote C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\interface2.jl:0
 [11] _pullback
    @ C:\Users\jackn\.julia\packages\Flux\BPPNj\src\layers\basic.jl:47 [inlined]
 [12] _pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{typeof(printdata), RNNModel, typeof(softmax)}, ::Vector{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}})
    @ Zygote C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\interface2.jl:0
 [13] _pullback
    @ C:\Users\jackn\.julia\packages\Flux\BPPNj\src\layers\basic.jl:49 [inlined]
 [14] _pullback(ctx::Zygote.Context, f::Chain{Tuple{typeof(printdata), RNNModel, typeof(softmax)}}, args::Vector{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}})
    @ Zygote C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\interface2.jl:0
 [15] _pullback
    @ c:\Users\jackn\Desktop\Julia Learning Code\FluxQuestion.jl:46 [inlined]
 [16] _pullback(::Zygote.Context, ::typeof(loss), ::Vector{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ Zygote C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\interface2.jl:0
 [17] _pullback
    @ c:\Users\jackn\Desktop\Julia Learning Code\FluxQuestion.jl:55 [inlined]
 [18] _pullback(::Zygote.Context, ::var"#9#10")
    @ Zygote C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\interface2.jl:0
 [19] pullback(f::Function, ps::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Zygote C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\interface.jl:352
 [20] gradient(f::Function, args::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Zygote C:\Users\jackn\.julia\packages\Zygote\IoW2g\src\compiler\interface.jl:75
 [21] top-level scope
    @ c:\Users\jackn\Desktop\Julia Learning Code\FluxQuestion.jl:55

I also realized that using a vector of matrices will require me to make a custom reshape function after the first RNN in the chain, in order to convert the 2D output of the first RNN into a 3D input (vector of matrices) for the next one. I presume that that this matrix reshaper will also cause trouble for the gradient calculator (possibly similar to the DataLoader did). How should I design this function so that it does not interfere with the gradient calculation/training of the model chain?

You should not use vectors of matrices if you care about accelerating your code on GPU, only 3D tensors. The standard way to leverage GPUs in machine learning is to do every operation on large batches of data. Your proposal serializes operations that should be run in parallel, which would kill GPU performances (possibly by a factor of 10 to 100 depending on the data).

I suggest that you write your model in such a way that only uses n-dimensional tensors.

1 Like

Thank you very much for your response @jonathan-laurent. I completely understand. However, I have read very thoroughly over the Flux RNN/Recurrent Model documentation very thoroughly, and I have not figured out how to process 3D tensors with recurrent models in Flux. The only way I have figured out for processing the data so far is by using a for loop to iterate over the matrices in a vector of matrices where each matrix represents a timestep of m sequences.

I would greatly appreciate some guidance or a simple example on how to code a recurrent model such as an RNN in Flux to process a 3D tensor.

To clarify the shape of the tensor once again, it is (n x m x p) where each slice of size(n x m) is one of p timesteps, and each (n x m) slice is of shape (num_features, num_sequences)

Thank you,

Jack

I have never used RNNs in Flux so I cannot really help you here. Isn’t this covered in the documentation? Feeding 3D tensors to RNNs should be the default and not some weird edge case (standard implementations in Pytorch or Tensorflow do not even support passing arrays of matrices).

If this isn’t properly documented, I suggest that you open a separate post and maybe contribute some better documentation after you’ve received some help.

1 Like

Thank you for your help, I will do that!

It is documented at the bottom of Recurrence · Flux (which Discourse tells me I’ve posted above :stuck_out_tongue:). The only reason we haven’t made it more prominent is because it’s unclear whether this should be the default API going forward. The main concern is how one would distinguish between a 3D array that represents a whole (batched) sequence, and one that represents a single timestep of a higher-dimensional batch (e.g. for a conv-rnn).

Using the built-in 3D array support is the easiest way to do this if your model is sequence-to-sequence. What confuses me is that your last example seems to be sequence → one now? Is that intentional, or are you still intending to have RNNModel be a sequence-to-sequence layer?