What is the equivalent Julia code for statsmodels.api.OLS

I want to write the equivalent Julia code for the following Python code:

import statsmodels.api as sm
reg = sm.OLS(Y,sm.add_constant(X))
res = reg.fit()
res.params[1]/res.bse[1]

The copilot asserts that the equivalent Julia code is:

using GLM
reg = glm(X, Y, Normal(), IdentityLink())
coef(reg)[2]

However, the result of these two is not equal. Example:

# Python
import numpy as np
import statsmodels.api as sm
X = np.array([[1, 2, 3],[4, 5, 6], [7, 8, 9]])
y = np.array([1, 0, 2])
reg = sm.OLS(y,sm.add_constant(X))
res = reg.fit()
res.params[1]/res.bse[1]
# -3.5606350123330477e-16
# Julia
using GLM
X = [
1 2 3
4 5 6
7 8 9
]
y = [1, 0, 2]
reg = glm(X, y, Normal(), IdentityLink())
coef(reg)[2]
# 0.0

Note that I am using statsmodels v0.13.5 in this case.

Your system is singular, the coefficients are not estimable.

>>> res.summary()
...
...

[2] The input rank is higher than the number of observations.
[3] The smallest eigenvalue is 6.54e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

And, also, you are adding a constant to X in the python regression. You should do that in julia too. In your example it can by done by X = hcat([1,1,1], X). If you use the formula interface (with DataFrames or another Table) rather than the matrix interface, it’s simpler.

using DataFrames
df = DataFrame(X, :auto)
df.y = y
fit(LinearModel, @formula(y ~ x1 + x2 + x3 + 1), df)
2 Likes

Just to be complete. Python uses a pseudo inverse for degenerate systems. There is really no canonical way to solve singular problems. Julia typically does it by setting superfluous coefficients to zero, with standard error Inf or NaN. You can mimic python’s behaviour as follows. However, with degenerate systems the interpretation of coefficients and standard errors will anyway be wishy-washy.

using LinearAlgebra

Xc = hcat([1,1,1], X)   # add a constant
cvmat = pinv(Xc' * Xc)  # compute covariance matrix as pseudoinverse
coefs = cvmat * Xc' * y   # perform the OLS
serr = sqrt.(diag(cvmat) .* 3/2)   # find standard errors, small sample adjust by N/(N-dof)
coefs[2]/serr[2]   # compute t-score
1 Like

Thank you so much for your replies. I don’t understand why you’ve used the constant 3/2. Anyway, the result is still different:

Am I missing something here?

Yes. You get something very close to zero. Something like 0.00000000000000002. Both python’s and julia’s numbers are essentially equal to zero. If you look at the other coefficients and standard errors, they are essentially equal.

Btw, the multiplication by 3/2 is a small sample correction of the standard errors. I.e. N/(N-1) where N is the number of observations. It does normally not have any large impact since the number of observations N is typically in the hundreds (or thousands, or millions). But in your example N==3.

Here’s an interesting thing with the python script:

import numpy as np
import statsmodels.api as sm
X = np.array([[1, 2, 3],[4, 5, 6], [7, 8, 9]])
y = np.array([1, 0, 2])
reg = sm.OLS(y,sm.add_constant(X))
res = reg.fit()
res.params[1]/res.bse[1]   # this is some small number
res = reg.fit(method='qr')  # this one fails 
res = reg.fit()   # do it again with default method 'pinv'
res.params[1]/res.bse[1]
# returns nan (because res.bse[1] is nan as it should be)

But the signs are different. The output in the Julia is a positive value while the Python’s output is a negative value. There should be something different for sure. So, I don’t think the proposed Julia code is equivalent to the Python implementation.

Let me check this out. If the coefficients are similar, then there should be the equivalent Julia code for the res.params[1]/res.bse[1]. Otherwise, the problem lies in fitting the OLS in Julia implementation in the same way as sm.OLS does. Actually, this res.params[1]/res.bse[1] is the important part of the code to me and the final result that should be used in my project.

I didn’t get the point. What is the conclusion?

The main point is that OLS on a singular system is not well defined. As I have told you numerous times. The coefficients are not estimable. Entirely arbitrary choices must be made to get some coefficients. I have described julia’s choices, and python’s choices, and described how you can mimic python’s choices in julia. The coefficients you obtain are relative to these choices, and typically are not meaningful without analyzing the nature of the singularity (or multicollinearity which it is often called in this context). Though the model as such can be used for prediction.

And, of course, as always with floating point operations, when comparing results it must be done up to a precision, there are no meaningful exact comparisons of floating point numbers, except in some very simple cases. The problem at hand, computing a pseudo inverse, involves inverting singular values which are not too small, i.e. one must choose what “too small” means, this will have an impact on the pseudoinverse, and consequently on the computed coefficients. Julia and python may have slightly different parameters for computing the pseudoinverse, and even slightly different ways to compute the singular value decomposition. I don’t know.

Also, in the python-mimicry I multiplied the covariance matrix with Xc' * y to find the coefficients. Because I needed the covariance matrix anyway to find the standard errors. It is more common to solve the system Xc' * Xc * x = Xc' * y for x, which is mathematically the same, but typically uses a different method, e.g. a pivoted Cholesky decomposition or qr or whatnot. This will yield slightly different estimates, i.e. in the 17th digit or so. What exactly python does, I don’t know. Fortunately, it doesn’t matter, because of the following.

If your program relies on exact floating point values which varies with more or less arbitrary choices made by some package, you are doing it the wrong way.

Now, what you are computing is known as “t-score” or “t-value”, how many standard errors your estimate is away from zero. This is a sensible thing to compute, and is used to judge whether the results are statistically significant. When the t-score is small (how small depends on your application), the coefficient is indistinguishable from zero with the data you have. The difference between 0.001 and -0.001, with a standard error around 1 is simply completely irrelevant.

4 Likes

Ok, this is going to be a long reply:
The aim is not to interpret the t-score! The value itself matters. I want to translate a Python code that detects bubble prices in historical data of an asset (like Microsoft, AKA MSFT). The code I was asking for is a part of this procedure. I will provide the full Python code and the translated Julia code in the following section to show why the value of res.params[1]/res.bse[1] matters not the interpretation of it. I run both codes on the same data to compare the results. I compared the X and y in both scripts to make sure the same data is being fed to the OLS model. I did it by printing the values of X and y and raising an arbitrary error after the print statements (to stop the run and compare the results by observing them).

Pyhton and Julia implementation in a glance

Pyhton implementation
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
prices = pd.read_csv("../data/GDP.csv", header=0).GDP
r0 = int(len(prices)*0.1)
#specify lags for the augmented Dickey-Fuller (ADF) test
adf_lags = 3
#critical value of the right-tailed ADF-test (95%) from Phillips et al. (2015)
crit = 1.49
#transforming data
log_prices = np.array(np.log(prices))
delta_log_prices = log_prices[1:] - log_prices[:-1]
n = len(delta_log_prices)
BSADF = np.array([])
#calculating ADF stats
for r2 in range(r0,n):
    ADFS = np.array([])
    for r1 in range(0,r2-r0+1):
        X0 = log_prices[r1:r2+1]
        X = pd.DataFrame()
        X[0] = X0
        for j in range(1,adf_lags+1):
            X[j] = np.append(np.zeros(j),delta_log_prices[r1:r2+1-j])
        X = np.array(X)
        Y = delta_log_prices[r1:r2+1]
        reg = sm.OLS(Y,sm.add_constant(X))
        res = reg.fit()
        ADFS = np.append(ADFS, res.params[1]/res.bse[1])
        
    BSADF = np.append(BSADF, max(ADFS))
#visualising the results
plt.rc('xtick',labelsize = 8)
plt.plot(prices.index[r0+1:],BSADF)
plt.plot(prices.index[r0+1:],np.ones(len(BSADF))*crit)
#printing dates when bubbles were detected
print(prices.index[r0+1:][BSADF > crit])
Julia implementation
using LinearAlgebra
using Plots

function bubbledetector(lnprices::AbstractVector, adf_lags::Integer, crit::AbstractFloat)
  Δlnprices = diff(lnprices)
  n         = length(Δlnprices)
  r0        = round(Int, length(lnprices)*0.1)
  BSADF     = similar(lnprices, n-1-r0+1)
  for (idxr2, r2) ∈ enumerate(r0:n-1)
    ADFS      = similar(lnprices, r2-r0+1)
    for (idxr1, r1) ∈ enumerate(0:r2-r0)
      X₀      = lnprices[r1+1:r2]
      X       = similar(lnprices, length(X₀), adf_lags+1)
      X[:, 1] = X₀
      for j in 1:adf_lags
          X[:, j+1] = vcat(zeros(j), Δlnprices[r1+1:r2-j])
      end
      Y = Δlnprices[r1+1:r2]
      Xc = hcat(ones(size(X, 1)), X)   # add a constant
      cvmat = pinv(Xc' * Xc)  # compute covariance matrix as pseudoinverse
      coefs = cvmat * Xc' * Y   # perform the OLS
      serr = sqrt.(diag(cvmat) .* length(Y)/(length(Y)-1))   # find standard errors, small sample adjust by N/(N-dof)
      ADFS[idxr1] = coefs[2]/serr[2]
    end
    BSADF[idxr2] = maximum(ADFS)
  end
  return BSADF, findall(BSADF .> crit)
end

function plotbubles(BSADF::AbstractVector, idxs::AbstractVector, crit::AbstractFloat)
  plot(BSADF, label="BSADF")
  plot!(ones(length(BSADF))*crit, color=:red, linestyle=:dash, label="Critical Value")
  plot!(idxs, BSADF[idxs], seriestype=:scatter, color=:red, label="Bubbles", alpha=0.5)
end

prices = identity.(readdlm("./data/GDP.csv", ',', header=true)[1][:, 2])
BSADF, idxs = bubbledetector(log.(prices), 3, 0.8)
plotbubles(BSADF, idxs, 0.8)
Python's and Julia's output in a glance


The thing is, the values/trend isn’t even similar to the Python’s output! Not even close! Meaning the values are extremely different! As I said earlier, I checked the inputs (X and y) in both scripts to make sure the same data is being fed to the OLS models. In conclusion, the difference is because of the OLS implementation.

I brought up all these details to explain why I need the same output as res.params[1]/res.bse[1].

So, there is no hope to get the same result necessarily due to different implementations. Thank you.

There is no hope of getting meaningful results with a singular system such as in your example. It seems your real system is not degenerate, and then you can of course get the same results, up to the hardware precision. You can use the fit(LinearModel, ...) as I suggested, or glm(...) as you tried, but with the intercept added, i.e. glm(Xc, y, Normal(), IdentityLink()).

I thought your main issue was that you wanted results down to the same bit when the design matrix is singular, as in your example. That’s a tall order.

Now, anyway, I see I made an error in my julia program. Python does not seem to do a small sample correction of the standard errors, so that one can be removed. By a coincidence in the example that factor happened to be equal to the one I forgot, to multiply by the residual variance, so the standard errors can be computed with something like this:

resid = norm(Xc * coefs .- y)
serr = sqrt.(diag(cvmat)) .* resid

And, actually, if you only use a single element, you can compute it as serr2 = sqrt(cvmat[2,2]) * resid.

1 Like

Yes, I got your point.

You’re right. I should have brought up the details in the first place. I am sorry for any confusion or inconvenience that may caused.

Considering reg = glm(Xc, Y, Normal(), IdentityLink()), I don’t know what should be the next line to compute the equivalent res.params[1]/res.bse[1] in this case.

This is great. Although there are some negligible differences as shown in the picture below, the output’s trend is extremely similar to Python’s output. The only thing is the range of output. It seems like in Python, the values are multiplied by a constant value > 1 (it seems that the final BSADF values in Julia should be multiplied by 10):


The following Julia code is used for the plot above
using LinearAlgebra
using Plots

function bubbledetector(lnprices::AbstractVector, adf_lags::Integer, crit::AbstractFloat)
  Δlnprices = diff(lnprices)
  n         = length(Δlnprices)
  r0        = round(Int, length(lnprices)*0.1)
  BSADF     = similar(lnprices, n-1-r0+1)
  for (idxr2, r2) ∈ enumerate(r0:n-1)
    ADFS      = similar(lnprices, r2-r0+1)
    for (idxr1, r1) ∈ enumerate(0:r2-r0)
      X₀      = lnprices[r1+1:r2]
      X       = similar(lnprices, length(X₀), adf_lags+1)
      X[:, 1] = X₀
      for j in 1:adf_lags
          X[:, j+1] = vcat(zeros(j), Δlnprices[r1+1:r2-j])
      end
      Y = Δlnprices[r1+1:r2]
      Xc = hcat(ones(size(X, 1)), X)   # add a constant
      cvmat = pinv(Xc' * Xc)  # compute covariance matrix as pseudoinverse
      coefs = cvmat * Xc' * Y   # perform the OLS
      resid = norm(Xc * coefs .- Y)
      serr = sqrt(cvmat[2,2]) * resid
      ADFS[idxr1] = coefs[2]/serr
    end
    BSADF[idxr2] = maximum(ADFS)
  end
  return BSADF, findall(BSADF .> crit)
end

function plotbubles(BSADF::AbstractVector, idxs::AbstractVector, crit::AbstractFloat)
  plot(BSADF, label="BSADF")
  plot!(ones(length(BSADF))*crit, color=:red, linestyle=:dash, label="Critical Value")
  plot!(legend=:outerright)
end

prices = identity.(readdlm("./data/GDP.csv", ',', header=true)[1][:, 2])
BSADF, idxs = bubbledetector(log.(prices), 3, 1.49)
plotbubles(BSADF, idxs, 1.49)

In summary, I used the following lines as the equivalent Julia code for sm.OLS:

Xc = hcat(ones(size(X, 1)), X)   # add a constant
cvmat = pinv(Xc' * Xc)  # compute covariance matrix as pseudoinverse
coefs = cvmat * Xc' * Y   # perform the OLS
resid = norm(Xc * coefs .- Y)
serr = sqrt(cvmat[2,2]) * resid
coefs[2]/serr

It’s here: Home · GLM

coef(reg)[2] / stderror(reg)[2]

The only thing is the range of output. It seems like in Python, the values are multiplied by a constant value > 1 (it seems that the final BSADF values in Julia should be multiplied by 10):

This I do not understand. However, I took the OLS equations from the top of my head, and they work fine in a toy example:

using LinearAlgebra, GLM
X = [
    1 3 4
    2 8 1
    4 5 2
    6 7 2
    9 4 3
]

y = [5,4,6,2,3]
Xc = hcat(ones(size(X,1)), X)

# GLM:

regglm = glm(Xc, y, Normal(), IdentityLink())

# easy fit

reglm = fit(LinearModel, Xc, y)

#manual
cvmat = inv(Xc' * Xc)
cf = cvmat * Xc' * y
resid = norm(Xc * cf .- y)
se = sqrt.(diag(cvmat)) * resid


println("GLM:\n", regglm)
println("\n\nLM:\n", reglm)
println("\n\nMAN:")
[cf se]

and the same in python:

import numpy as np
import statsmodels.api as sm

X = np.array([[1, 3, 4],[2,8,1], [4,5,2], [6,7,2], [9,4,3]])
y = np.array([5,4,6,2,3])
reg = sm.OLS(y,sm.add_constant(X))
res = reg.fit()
res.params
res.bse

They are all equal.

2 Likes

Thank you so much. This is exactly what I needed and the output exactly mimics the sm.OLS’s output! Thank you so much for your help.