If I have some electricity demand, it is reasonable to assume that it will be similar to the day before. So with hourly data, 24 steps before. Additional influences are also present:

```
using StatsModels, DataFrames, GLM
N = 30 * 24
df = DataFrame(y=rand(N), x=randn(N))
f = @formula(y ~ x + lag(y, 24))
f = apply_schema(f, schema(f, df))
```

How can i do a multi-step prediction in a programmatic way? The problem is with the autoregressive term.

I guess, I could use `predict`

on a dataframe that has the training data plus one row where `y`

is missing, and use the result to substitute it. Append a new row, and do this in a loop. But is there a more efficient way?

The best you can do is 24 rows at a time without forward substituting and solving for later periods yourself, you have to simulate. (I’m not a time series person, so I would pick the simulating over forward substitution myself.)

I came up with this:

```
function simulateModel(schema, df_train, df_test)
N_train =size(df_train,1)
N_test = size(df_test,1)
_, X_train = GLM.modelcols(schema, df_train)
n_cols = mapreduce(GLM.width, +, schema.rhs.terms)
X = Matrix{Union{Float64,Missing}}(undef, N_train + N_test, n_cols)
X[1:N_train,:] = X_train
model_col_idx_in_X = Dict()
j = 0
for (i, t) in enumerate(schema.rhs.terms)
model_col_idx_in_X[i] = j+1:j+GLM.width(t)
j += GLM.width(t)
end
df_ = copy(df_train)
ProgressMeter.@showprogress for i in 1:N_test
DataFrames.append!(df_, df_test[[i],:]) # here the `y` column is `NaN`
col_tabl = Tables.columntable(df_)
for (j, tt) in enumerate(schema.rhs.terms)
x = GLM.modelcols(tt, col_tabl)
X[N_train+i, model_col_idx_in_X[j]] = x[end,:]
end
y_hat = GLM.predict(ols_fit, X[[N_train+i],:])
df_.y[N_train+i] = y_hat[end]
end
df_pred = df_[end-N_test+1:1:end,:]
return df_pred
end
```

It is supposed to not allocate too much, however, `@profview`

still shows that I spend a lot of time here `StatsModels\src\terms.jl`

:

While it seems it is implemented very general, but it is not very efficient.

1 Like

Ah, that’s interesting to know! I’d be curious if you can come up with a more efficient way of computing interaction terms like that (the tricky thing is handling multi-column terms correctly).

Is it *time* being spent there or something due to allocations? At some point I played around a bit with doing that in a non-allocating way but never got very far…

It’s time spent. So 66% of the time on this line.

1 Like