# What is the equivalent Julia code for statsmodels.api.OLS

I want to write the equivalent Julia code for the following Python code:

``````import statsmodels.api as sm
res = reg.fit()
res.params[1]/res.bse[1]
``````

The copilot asserts that the equivalent Julia code is:

``````using GLM
reg = glm(X, Y, Normal(), IdentityLink())
coef(reg)[2]
``````

However, the result of these two is not equal. Example:

``````# Python
import numpy as np
import statsmodels.api as sm
X = np.array([[1, 2, 3],[4, 5, 6], [7, 8, 9]])
y = np.array([1, 0, 2])
res = reg.fit()
res.params[1]/res.bse[1]
# -3.5606350123330477e-16
``````
``````# Julia
using GLM
X = [
1 2 3
4 5 6
7 8 9
]
y = [1, 0, 2]
reg = glm(X, y, Normal(), IdentityLink())
coef(reg)[2]
# 0.0
``````

Note that I am using `statsmodels` v0.13.5 in this case.

Your system is singular, the coefficients are not estimable.

``````>>> res.summary()
...
...

[2] The input rank is higher than the number of observations.
[3] The smallest eigenvalue is 6.54e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
``````

And, also, you are adding a constant to `X` in the python regression. You should do that in julia too. In your example it can by done by `X = hcat([1,1,1], X)`. If you use the formula interface (with `DataFrames` or another `Table`) rather than the matrix interface, itâs simpler.

``````using DataFrames
df = DataFrame(X, :auto)
df.y = y
fit(LinearModel, @formula(y ~ x1 + x2 + x3 + 1), df)
``````
2 Likes

Just to be complete. Python uses a pseudo inverse for degenerate systems. There is really no canonical way to solve singular problems. Julia typically does it by setting superfluous coefficients to zero, with standard error `Inf` or `NaN`. You can mimic pythonâs behaviour as follows. However, with degenerate systems the interpretation of coefficients and standard errors will anyway be wishy-washy.

``````using LinearAlgebra

Xc = hcat([1,1,1], X)   # add a constant
cvmat = pinv(Xc' * Xc)  # compute covariance matrix as pseudoinverse
coefs = cvmat * Xc' * y   # perform the OLS
serr = sqrt.(diag(cvmat) .* 3/2)   # find standard errors, small sample adjust by N/(N-dof)
coefs[2]/serr[2]   # compute t-score
``````
1 Like

Thank you so much for your replies. I donât understand why youâve used the constant `3/2`. Anyway, the result is still different:

Am I missing something here?

Yes. You get something very close to zero. Something like 0.00000000000000002. Both pythonâs and juliaâs numbers are essentially equal to zero. If you look at the other coefficients and standard errors, they are essentially equal.

Btw, the multiplication by 3/2 is a small sample correction of the standard errors. I.e. `N/(N-1)` where `N` is the number of observations. It does normally not have any large impact since the number of observations `N` is typically in the hundreds (or thousands, or millions). But in your example `N==3`.

Hereâs an interesting thing with the python script:

``````import numpy as np
import statsmodels.api as sm
X = np.array([[1, 2, 3],[4, 5, 6], [7, 8, 9]])
y = np.array([1, 0, 2])
res = reg.fit()
res.params[1]/res.bse[1]   # this is some small number
res = reg.fit(method='qr')  # this one fails
res = reg.fit()   # do it again with default method 'pinv'
res.params[1]/res.bse[1]
# returns nan (because res.bse[1] is nan as it should be)
``````

But the signs are different. The output in the Julia is a positive value while the Pythonâs output is a negative value. There should be something different for sure. So, I donât think the proposed Julia code is equivalent to the Python implementation.

Let me check this out. If the coefficients are similar, then there should be the equivalent Julia code for the `res.params[1]/res.bse[1]`. Otherwise, the problem lies in fitting the OLS in Julia implementation in the same way as `sm.OLS` does. Actually, this `res.params[1]/res.bse[1]` is the important part of the code to me and the final result that should be used in my project.

I didnât get the point. What is the conclusion?

The main point is that OLS on a singular system is not well defined. As I have told you numerous times. The coefficients are not estimable. Entirely arbitrary choices must be made to get some coefficients. I have described juliaâs choices, and pythonâs choices, and described how you can mimic pythonâs choices in julia. The coefficients you obtain are relative to these choices, and typically are not meaningful without analyzing the nature of the singularity (or multicollinearity which it is often called in this context). Though the model as such can be used for prediction.

And, of course, as always with floating point operations, when comparing results it must be done up to a precision, there are no meaningful exact comparisons of floating point numbers, except in some very simple cases. The problem at hand, computing a pseudo inverse, involves inverting singular values which are not too small, i.e. one must choose what âtoo smallâ means, this will have an impact on the pseudoinverse, and consequently on the computed coefficients. Julia and python may have slightly different parameters for computing the pseudoinverse, and even slightly different ways to compute the singular value decomposition. I donât know.

Also, in the python-mimicry I multiplied the covariance matrix with `Xc' * y` to find the coefficients. Because I needed the covariance matrix anyway to find the standard errors. It is more common to solve the system `Xc' * Xc * x = Xc' * y` for `x`, which is mathematically the same, but typically uses a different method, e.g. a pivoted Cholesky decomposition or qr or whatnot. This will yield slightly different estimates, i.e. in the 17th digit or so. What exactly python does, I donât know. Fortunately, it doesnât matter, because of the following.

If your program relies on exact floating point values which varies with more or less arbitrary choices made by some package, you are doing it the wrong way.

Now, what you are computing is known as ât-scoreâ or ât-valueâ, how many standard errors your estimate is away from zero. This is a sensible thing to compute, and is used to judge whether the results are statistically significant. When the t-score is small (how small depends on your application), the coefficient is indistinguishable from zero with the data you have. The difference between `0.001` and `-0.001`, with a standard error around `1` is simply completely irrelevant.

4 Likes

Ok, this is going to be a long reply:
The aim is not to interpret the t-score! The value itself matters. I want to translate a Python code that detects bubble prices in historical data of an asset (like Microsoft, AKA MSFT). The code I was asking for is a part of this procedure. I will provide the full Python code and the translated Julia code in the following section to show why the value of `res.params[1]/res.bse[1]` matters not the interpretation of it. I run both codes on the same data to compare the results. I compared the `X` and `y` in both scripts to make sure the same data is being fed to the OLS model. I did it by printing the values of `X` and `y` and raising an arbitrary error after the print statements (to stop the run and compare the results by observing them).

Pyhton and Julia implementation in a glance

Pyhton implementation
``````import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
r0 = int(len(prices)*0.1)
#specify lags for the augmented Dickey-Fuller (ADF) test
#critical value of the right-tailed ADF-test (95%) from Phillips et al. (2015)
crit = 1.49
#transforming data
log_prices = np.array(np.log(prices))
delta_log_prices = log_prices[1:] - log_prices[:-1]
n = len(delta_log_prices)
for r2 in range(r0,n):
for r1 in range(0,r2-r0+1):
X0 = log_prices[r1:r2+1]
X = pd.DataFrame()
X[0] = X0
X[j] = np.append(np.zeros(j),delta_log_prices[r1:r2+1-j])
X = np.array(X)
Y = delta_log_prices[r1:r2+1]
res = reg.fit()

#visualising the results
plt.rc('xtick',labelsize = 8)
#printing dates when bubbles were detected
``````
Julia implementation
``````using LinearAlgebra
using Plots

Îlnprices = diff(lnprices)
n         = length(Îlnprices)
r0        = round(Int, length(lnprices)*0.1)
for (idxr2, r2) â enumerate(r0:n-1)
for (idxr1, r1) â enumerate(0:r2-r0)
Xâ      = lnprices[r1+1:r2]
X[:, 1] = Xâ
X[:, j+1] = vcat(zeros(j), Îlnprices[r1+1:r2-j])
end
Y = Îlnprices[r1+1:r2]
Xc = hcat(ones(size(X, 1)), X)   # add a constant
cvmat = pinv(Xc' * Xc)  # compute covariance matrix as pseudoinverse
coefs = cvmat * Xc' * Y   # perform the OLS
serr = sqrt.(diag(cvmat) .* length(Y)/(length(Y)-1))   # find standard errors, small sample adjust by N/(N-dof)
end
end
end

plot!(idxs, BSADF[idxs], seriestype=:scatter, color=:red, label="Bubbles", alpha=0.5)
end

BSADF, idxs = bubbledetector(log.(prices), 3, 0.8)
``````
Python's and Julia's output in a glance

The thing is, the values/trend isnât even similar to the Pythonâs output! Not even close! Meaning the values are extremely different! As I said earlier, I checked the inputs (`X` and `y`) in both scripts to make sure the same data is being fed to the OLS models. In conclusion, the difference is because of the OLS implementation.

I brought up all these details to explain why I need the same output as `res.params[1]/res.bse[1]`.

So, there is no hope to get the same result necessarily due to different implementations. Thank you.

There is no hope of getting meaningful results with a singular system such as in your example. It seems your real system is not degenerate, and then you can of course get the same results, up to the hardware precision. You can use the `fit(LinearModel, ...)` as I suggested, or `glm(...)` as you tried, but with the intercept added, i.e. `glm(Xc, y, Normal(), IdentityLink())`.

I thought your main issue was that you wanted results down to the same bit when the design matrix is singular, as in your example. Thatâs a tall order.

Now, anyway, I see I made an error in my julia program. Python does not seem to do a small sample correction of the standard errors, so that one can be removed. By a coincidence in the example that factor happened to be equal to the one I forgot, to multiply by the residual variance, so the standard errors can be computed with something like this:

``````resid = norm(Xc * coefs .- y)
serr = sqrt.(diag(cvmat)) .* resid
``````

And, actually, if you only use a single element, you can compute it as `serr2 = sqrt(cvmat[2,2]) * resid`.

1 Like

Youâre right. I should have brought up the details in the first place. I am sorry for any confusion or inconvenience that may caused.

Considering `reg = glm(Xc, Y, Normal(), IdentityLink())`, I donât know what should be the next line to compute the equivalent `res.params[1]/res.bse[1]` in this case.

This is great. Although there are some negligible differences as shown in the picture below, the outputâs trend is extremely similar to Pythonâs output. The only thing is the range of output. It seems like in Python, the values are multiplied by a constant value > 1 (it seems that the final `BSADF` values in Julia should be multiplied by 10):

The following Julia code is used for the plot above
``````using LinearAlgebra
using Plots

Îlnprices = diff(lnprices)
n         = length(Îlnprices)
r0        = round(Int, length(lnprices)*0.1)
for (idxr2, r2) â enumerate(r0:n-1)
for (idxr1, r1) â enumerate(0:r2-r0)
Xâ      = lnprices[r1+1:r2]
X[:, 1] = Xâ
X[:, j+1] = vcat(zeros(j), Îlnprices[r1+1:r2-j])
end
Y = Îlnprices[r1+1:r2]
Xc = hcat(ones(size(X, 1)), X)   # add a constant
cvmat = pinv(Xc' * Xc)  # compute covariance matrix as pseudoinverse
coefs = cvmat * Xc' * Y   # perform the OLS
resid = norm(Xc * coefs .- Y)
serr = sqrt(cvmat[2,2]) * resid
end
end
end

plot!(legend=:outerright)
end

BSADF, idxs = bubbledetector(log.(prices), 3, 1.49)
``````

In summary, I used the following lines as the equivalent Julia code for `sm.OLS`:

``````Xc = hcat(ones(size(X, 1)), X)   # add a constant
cvmat = pinv(Xc' * Xc)  # compute covariance matrix as pseudoinverse
coefs = cvmat * Xc' * Y   # perform the OLS
resid = norm(Xc * coefs .- Y)
serr = sqrt(cvmat[2,2]) * resid
coefs[2]/serr
``````

Itâs here: Home Âˇ GLM

``````coef(reg)[2] / stderror(reg)[2]
``````

The only thing is the range of output. It seems like in Python, the values are multiplied by a constant value > 1 (it seems that the final `BSADF` values in Julia should be multiplied by 10):

This I do not understand. However, I took the OLS equations from the top of my head, and they work fine in a toy example:

``````using LinearAlgebra, GLM
X = [
1 3 4
2 8 1
4 5 2
6 7 2
9 4 3
]

y = [5,4,6,2,3]
Xc = hcat(ones(size(X,1)), X)

# GLM:

regglm = glm(Xc, y, Normal(), IdentityLink())

# easy fit

reglm = fit(LinearModel, Xc, y)

#manual
cvmat = inv(Xc' * Xc)
cf = cvmat * Xc' * y
resid = norm(Xc * cf .- y)
se = sqrt.(diag(cvmat)) * resid

println("GLM:\n", regglm)
println("\n\nLM:\n", reglm)
println("\n\nMAN:")
[cf se]
``````

and the same in python:

``````import numpy as np
import statsmodels.api as sm

X = np.array([[1, 3, 4],[2,8,1], [4,5,2], [6,7,2], [9,4,3]])
y = np.array([5,4,6,2,3])
Thank you so much. This is exactly what I needed and the output exactly mimics the `sm.OLS`âs output! Thank you so much for your help.