Alright so I got there eventually and I’ll explain what I did that got good results for me. Please chime in if I misrepresent anything. I’ve gone into a good bit of detail so others can follow the logic, but the logic may not be correct!
My first problem was using a GRU instead of an LSTM. From what I observed, the GRU just couldn’t capture the longer term patterns. So maybe try an LSTM first. Trying to force the GRU took me on a circuitous, but educational path.
Packaging the Data
My data was ~2 months of data in 30-minute intervals with 1 column being DateTime, the next 4 being weather (temp, cloud cover, windspeed, winddir) and the last 4 being energy (solar production, wind production, day ahead price, and imbalance price). Some things that helped me:
- Normalize the data. I went with standardizing at first because I wanted the possibility of predicting extreme values for price data. For otherwise good models, this resulted in negative energy production which is nonsensical. I wanted to keep this as a MultiTarget LSTM regression task so the final activation had to be the same so all targets had to be the same. The results I got were good enough once I did this so I didn’t explore if there’s an elegant way to do this and capture more nuance.
- Timestep size and batch size. The Flux docs for recurrence do make this clear, but it took me a minute to really set it up correctly.
- Let X be a vector of your input data. Each element in X represents the collection of features at that time step. Each sample is one column with as many rows as input features. The reason you may want many columns is so that your gradients are taken over a larger batch and the optimizer is making more general improvements rather than looking at a single time. The order of the columns makes no difference as long as you are consistent across time steps. So for
X[1][:, 1]
should be the observation immediately preceding X[2][:, 1]
and X[t][:, k]
be the observations immediately before X[t+1][:, k]
. So looking at an abridged set:
Time |
Wind |
Day Ahead Price (DAP) |
1 |
W1 |
D1 |
2 |
W2 |
D2 |
3 |
W3 |
D3 |
4 |
W4 |
D4 |
5 |
W5 |
D5 |
6 |
W6 |
D6 |
7 |
W7 |
D7 |
batched data would look like this where each element in batched_data
is a Matrix of features \times batchsize. The columns are in this order just to show that the order of the columns doesn’t matter as long as they are consistent across each Matrix in batched_data
.
batched_data = [
[
W1 W3 W4 W2
D1 D3 D4 D2
],
[
W2 W4 W5 W3
D2 D4 D5 D3
],
[
W3 W5 W6 W4
D3 D5 D6 D4
],
]
- Y is the vector representing what you want to predict. Y should be the same overall length as X, and the number of columns should match, but the number of rows is what you want to predict. If you want to make predictions for every feature, then you just need to get the next time step. For me, I only wanted to predict energy data, or DAP in this abridged example. So Y looks like:
batched_targets = [
[
D2 D4 D5 D3
],
[
D3 D5 D6 D4
],
[
D4 D6 D7 D5
],
]
Finally I zipped these into a Vector of Named Tuple for my own convenience:
train_data = [(past=x, next=y) for (x,y) in zip(batched_data, batched_targets)]
- One-step ahead. Maybe this is obvious and this was the crux of my original questions. The most sensible thing to do is predict one step and then use that prediction and any other information that is available for that time and predict the next step and so on. I got hung up on the idea of how do I provide it all the information I have available right now. For this problem I wanted to make 24-hours of predictions beginning at 8AM with the assumption that I would have all weather and energy data up to 8 AM and I would have weather forecasts for the next 24-hours. The short version of the story is that this was a bad time.
Building and Training the Model
Model
This is a model that worked for me.
model = Chain(
LSTM_in = LSTM(input_dims => hidden_dim),
LSTM_hidden = LSTM(hidden_dim => hidden_dim),
Dense_hidden1 = Dense(hidden_dim => hidden_dim),
Dense_out = Dense(hidden_dim => output_dim, σ),
)
# Initializing loss logs whenever the model is built.
Flux.reset!(model)
train_log = [loss(model, first(Train))]
Flux.reset!(model)
test_log = [loss(model, first(Test))]
Loss
I stored my data as a vector of NamedTuple so I could access it easily. I used MSE
for my loss.
loss(m, X, Y) = Flux.mse(m(X), Y)
loss(m::Chain, traindata::NamedTuple) = loss(m, traindata.past, traindata.next)
Optimizer
Used Adam
to optimize:
opt_state = Flux.setup(Adam(), model)
train!
Function
function train!(opt_state, model, loss, traindata; train_log=[])
L, ∇ = Flux.withgradient(loss, model, traindata)
# Detect loss of Inf or NaN. Print a warning, and then skip update!
if !isfinite(L)
@warn "Loss value, \"$L\", is invalid."
else
Flux.update!(opt_state, model, ∇[1])
end
push!(train_log, L)
end
Training Loop
for e in 1:train_epochs
# need to gather all loss so I can get an average for each epoch so I can compare it to the loss from the training set. They are not the same length so I want an average at each epoch
temp_log = []
# recurrent networks need to have their internal state reset
Flux.reset!(model)
# before training, I condition the recurrent model by just calling it on the first batch in my train set
model(first(Train).past)
# now i train using my `train!` function for the rest of the data.
for T in Train[2:end]
train!(opt_state, model, loss, T, train_log=temp_log)
end
# get the mean of my losses for this epoch
push!(train_log, mean(temp_log))
# rest the model, run it on everything in my Test set after conditioning it and then averaging it and putting it in the test log
Flux.reset!(model)
model(first(Test).past)
push!(test_log, mean(loss(model, T) for T in Test[2:end]))
end
# plotting after each run of epochs to watch for progress and overfitting
plot(
title="Loss Logging",
xlabel="Total Training Epochs", ylabel="Loss (MSE)",
[train_log test_log],
label=["Train" "Test"],
xticks=2 .^ (0:8),
xscale=:log2)
Making Predictions
Below is the function I made to use the LSTM as intended. The point was to condition the model on all data up to the time before the the first forecast (t+1) and then:
- make a forecast for the first forecast at t+1. Store the result.
- combine that with the available weather data to make a forecast for t+2. Store the result.
- repeat until forecasting period is complete.
- handle all data transformations and outputs.
function forecast(data, firstforecast, lastforecast)
firstforecast = findlast(row -> row ≤ firstforecast, data.timestamp_utc)
lastforecast = findlast(row -> row < lastforecast, data.timestamp_utc)
xfrm = standardize_df(data, transforms)
M = xfrm |> Matrix |> transpose
Flux.reset!(model)
for i in 1:size(M, firstforecast)-1
model(M[:, i])
end
results = [model(M[:, firstforecast])]
for j in firstforecast+1:lastforecast
new_result = model([M[1:5, j];last(results)])
push!(results, new_result)
end
out_df = DataFrame(hcat(results...) |> transpose, energy_cols)
out_df.timestamp_utc = data[firstforecast:lastforecast, :timestamp_utc]
select!(out_df, :timestamp_utc, :)
out_df = reconstruct_df(out_df, transforms)
end
That’s it for my lessons learned for now. Would this be worth making a full tutorial for the model zoo?