If I have some electricity demand, it is reasonable to assume that it will be similar to the day before. So with hourly data, 24 steps before. Additional influences are also present:
using StatsModels, DataFrames, GLM
N = 30 * 24
df = DataFrame(y=rand(N), x=randn(N))
f = @formula(y ~ x + lag(y, 24))
f = apply_schema(f, schema(f, df))
How can i do a multi-step prediction in a programmatic way? The problem is with the autoregressive term.
I guess, I could use predict
on a dataframe that has the training data plus one row where y
is missing, and use the result to substitute it. Append a new row, and do this in a loop. But is there a more efficient way?
The best you can do is 24 rows at a time without forward substituting and solving for later periods yourself, you have to simulate. (I’m not a time series person, so I would pick the simulating over forward substitution myself.)
I came up with this:
function simulateModel(schema, df_train, df_test)
N_train =size(df_train,1)
N_test = size(df_test,1)
_, X_train = GLM.modelcols(schema, df_train)
n_cols = mapreduce(GLM.width, +, schema.rhs.terms)
X = Matrix{Union{Float64,Missing}}(undef, N_train + N_test, n_cols)
X[1:N_train,:] = X_train
model_col_idx_in_X = Dict()
j = 0
for (i, t) in enumerate(schema.rhs.terms)
model_col_idx_in_X[i] = j+1:j+GLM.width(t)
j += GLM.width(t)
end
df_ = copy(df_train)
ProgressMeter.@showprogress for i in 1:N_test
DataFrames.append!(df_, df_test[[i],:]) # here the `y` column is `NaN`
col_tabl = Tables.columntable(df_)
for (j, tt) in enumerate(schema.rhs.terms)
x = GLM.modelcols(tt, col_tabl)
X[N_train+i, model_col_idx_in_X[j]] = x[end,:]
end
y_hat = GLM.predict(ols_fit, X[[N_train+i],:])
df_.y[N_train+i] = y_hat[end]
end
df_pred = df_[end-N_test+1:1:end,:]
return df_pred
end
It is supposed to not allocate too much, however, @profview
still shows that I spend a lot of time here StatsModels\src\terms.jl
:
While it seems it is implemented very general, but it is not very efficient.
1 Like
Ah, that’s interesting to know! I’d be curious if you can come up with a more efficient way of computing interaction terms like that (the tricky thing is handling multi-column terms correctly).
Is it time being spent there or something due to allocations? At some point I played around a bit with doing that in a non-allocating way but never got very far…
It’s time spent. So 66% of the time on this line.
1 Like