Multi-step prediction of autoregressive model from StatsModels

ohmsweetohm1 · April 17, 2024, 7:13pm

If I have some electricity demand, it is reasonable to assume that it will be similar to the day before. So with hourly data, 24 steps before. Additional influences are also present:

using StatsModels, DataFrames, GLM
N = 30 * 24
df = DataFrame(y=rand(N), x=randn(N))
f = @formula(y ~ x + lag(y, 24))
f = apply_schema(f, schema(f, df))

How can i do a multi-step prediction in a programmatic way? The problem is with the autoregressive term.

I guess, I could use predict on a dataframe that has the training data plus one row where y is missing, and use the result to substitute it. Append a new row, and do this in a loop. But is there a more efficient way?

nateybear · April 17, 2024, 7:48pm

The best you can do is 24 rows at a time without forward substituting and solving for later periods yourself, you have to simulate. (I’m not a time series person, so I would pick the simulating over forward substitution myself.)

ohmsweetohm1 · April 18, 2024, 1:39pm

I came up with this:

function simulateModel(schema, df_train, df_test)
    N_train =size(df_train,1) 
    N_test = size(df_test,1)

    _, X_train = GLM.modelcols(schema, df_train)

    n_cols = mapreduce(GLM.width, +, schema.rhs.terms)
    X = Matrix{Union{Float64,Missing}}(undef, N_train + N_test, n_cols)
    X[1:N_train,:] = X_train

    model_col_idx_in_X = Dict()
    j = 0
    for (i, t) in enumerate(schema.rhs.terms) 
        model_col_idx_in_X[i] = j+1:j+GLM.width(t)
        j += GLM.width(t)
    end

    df_ = copy(df_train)

    ProgressMeter.@showprogress for i in 1:N_test
        DataFrames.append!(df_, df_test[[i],:]) # here the `y` column is `NaN`

        col_tabl = Tables.columntable(df_)

        for (j, tt) in enumerate(schema.rhs.terms)
            x = GLM.modelcols(tt, col_tabl)
            X[N_train+i, model_col_idx_in_X[j]] = x[end,:]
        end

        y_hat = GLM.predict(ols_fit, X[[N_train+i],:])

        df_.y[N_train+i] = y_hat[end]
    end
    df_pred = df_[end-N_test+1:1:end,:]

    return df_pred
end

It is supposed to not allocate too much, however, @profview still shows that I spend a lot of time here StatsModels\src\terms.jl:

While it seems it is implemented very general, but it is not very efficient.

dave.f.kleinschmidt · July 3, 2024, 4:21pm

Ah, that’s interesting to know! I’d be curious if you can come up with a more efficient way of computing interaction terms like that (the tricky thing is handling multi-column terms correctly).

Is it time being spent there or something due to allocations? At some point I played around a bit with doing that in a non-allocating way but never got very far…

ohmsweetohm1 · July 3, 2024, 7:43pm

It’s time spent. So 66% of the time on this line.

Topic		Replies	Views
Predict fails for simple case of GLM? General Usage question , glm	6	746	January 5, 2022
Summing (StatsModels) terms General Usage question	10	634	October 11, 2020
GARCH simulation New to Julia question	5	1374	March 11, 2018
Use of StatsModels? Statistics	7	1066	October 31, 2018
Add column names to a GLM.LinearModel estimated in matrix form Statistics question , glm	11	1152	July 1, 2020

Multi-step prediction of autoregressive model from StatsModels

Related topics