When you create `data_train`

, you create it as a tuple of `(X_batched, Y_batched)`

, hence, when you do

```
for d in data_train
# ...
end
```

The loop runs twice. Once with `d = X_batched`

, once with `d = Y_batched`

. This is not what you are trying to achieve.

Instead, to iterate over each of the 10 batches, you could use:

```
for d in zip(data_train...)
# ...
end
```

Now, I believe there are a few other things that are problematic in your MWE.

- In your mini-batch training with 10 batches, you have a
`Y_batched[i]`

that is `10 Ă 600`

, this means you expect 10 outputs to your model, currently it has only 1. I believe your batching approach is not what you are trying to achieve.
- In your
`loss`

, you use `Flux.reset!()`

at every step of the sequence. You should reset before/after the sequence, but not during it.

Here is a modification of your MWE to make it work.

```
using Flux
using CUDA,Zygote
X_train = rand(Float32, 60000, 4)
Y_train = rand(Float32, 60000, 1)
seq_len = 600
batch_size = 10
no_features = 4
N = size(X_train,1)
num_batches = Int(floor(N/(batch_size*seq_len)))
# Create batches of a time series `X` by splitting the series into
# sequences of length `s`. Each new sequence is shifted by `r` steps.
# When s == r, the series is split into non-overlapping batches.
function batch_timeseries(X, s::Int, r::Int)
@assert r > 0 "r must be positive"
# If X is passed in format TĂ1, reshape it
if isa(X, AbstractVector)
X = permutedims(X)
end
T = size(X, 2)
@assert s â¤ T "s cannot be longer than the total series"
# Ensure uniform sequence lengths by dropping the first observations until
# the total sequence length matches a multiple of the batchsize
X = X[:, ((T - s) % r)+1:end]
[X[:, t:r:end-s+t] for t â 1:s] # Output
end
# mini batching
X_batched = [batch_timeseries(permutedims(X_train[(1 + (k - 1) * seq_len * batch_size):(k * seq_len * batch_size), :]), seq_len, seq_len) for k â 1:num_batches]
Y_batched = [batch_timeseries(permutedims(Y_train[(1 + (k - 1) * seq_len * batch_size):(k * seq_len * batch_size), :]), seq_len, seq_len) for k â 1:num_batches]
gpu_or_cpu = cpu
if gpu_or_cpu ==gpu
CUDA.allowscalar(false)
end
# convert to cpu or gpu (apply element wise)
X_batched = gpu_or_cpu.(X_batched)
Y_batched = gpu_or_cpu.(Y_batched)
data_train = (X_batched, Y_batched)
# select optimizer
opt = ADAM(0.001, (0.9, 0.999))
# definition of the loss function
function loss(m, X, Y)
[m(X[1])] # Warm-up the model on the first observation
sum(sum(abs2, m(xi) - yi) for (xi, yi) in zip(X[2:end], Y[2:end]))
end
# ini of the model
model = Chain(LSTM(4, 70), LSTM(70, 70), LSTM(70, 70), Dense(70, 1, relu)) |> gpu_or_cpu
ps = Flux.params(model)
Flux.reset!(model)
## use Zygote.pullback to access the training loss and the gradient
#The following code is error
for d in zip(data_train...)
Flux.reset!(model) # Reset the model before each minibatch
train_loss, back = Zygote.pullback(() -> loss(model, d...), ps)
gs = back(one(train_loss))
Flux.update!(opt, ps, gs)
end
end
```