How can I tell Julia what are the parameters in a linear model?

bertulli · September 23, 2022, 11:27am

Hi all!

So, I’m experimenting with the linear regression ecosystem (GLM, Plots, DataFrames, StatsPlots and CSV). I understood how to produce a simple linear regression model and a multivariate one.
I know that OLS works even when your function is not linear in the variables, provided it is in the parameters. For instance, I tried fitting this paraboloid:

# excerpt to generate it
f((x,y)) = (x, y, x^2 + 2 * y^2 + 10*randn())

# excerpt to fit it
fm = @formula(z ~ x^2 + y^2)

and it works! Yay!

However, now I’m trying shifting the vertex:

f((x,y)) = (x, y, (x - x_0)^2 + 2 * (y - y_0)^2 + 10*randn())

Of course, now the estimate is off. How can I tell Julia that there are two more parameters to fit?

I’ve tried this

fm = @formula(z ~ (x - x_0)^2 +  (y - y_0)^2)

but it panics when I try to fit it:

julia> fm = @formula(z ~ (x - x_0)^2 +  (y - y_0)^2)
FormulaTerm
Response:
  z(unknown)
Predictors:
  (x,x_0)->(x - x_0) ^ 2
  (y,y_0)->(y - y_0) ^ 2

julia> @time(model = lm(fm, df))
ERROR: ArgumentError: There isn't a variable called 'x_0' in your data; the nearest names appear to be: x
Stacktrace:
 [1] ModelFrame(f::FormulaTerm{Term, Tuple{FunctionTerm{typeof(^), var"#87#91", (:x, :x_0)}, FunctionTerm{typeof(^), var"#89#93", (:y, :y_0)}}}, data::NamedTuple{(:x, :y, :z), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float64}}}; model::Type{LinearModel}, contrasts::Dict{Symbol, Any})
   @ StatsModels ~/.julia/packages/StatsModels/6uCgU/src/modelframe.jl:76
 [2] fit(::Type{LinearModel}, f::FormulaTerm{Term, Tuple{FunctionTerm{typeof(^), var"#87#91", (:x, :x_0)}, FunctionTerm{typeof(^), var"#89#93", (:y, :y_0)}}}, data::DataFrame, args::Nothing; contrasts::Dict{Symbol, Any}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ StatsModels ~/.julia/packages/StatsModels/6uCgU/src/statsmodel.jl:85
 [3] fit
   @ ~/.julia/packages/StatsModels/6uCgU/src/statsmodel.jl:78 [inlined]
 [4] #lm#5
   @ ~/.julia/packages/GLM/P0Ris/src/lm.jl:184 [inlined]
 [5] lm (repeats 2 times)
   @ ~/.julia/packages/GLM/P0Ris/src/lm.jl:184 [inlined]
 [6] top-level scope
   @ ./timing.jl:262 [inlined]
 [7] top-level scope
   @ ./REPL[287]:0

Is there a way to do it?

Thanks!

bertulli · September 23, 2022, 2:31pm

I was reasoning about my problem and it occured to me that expanding (x - x_0)^2 I get:

(x - x_0)^2 = x^2 - 2x_0*x + x_0^2

Is the last term (x_0^2) making the problem non-linear? Is this the meaning of the error produced by Julia? If yes, is there a way to solve it (which I presume is another function, for non linear models)?

pdeffebach · September 23, 2022, 3:02pm

The problem is just about parsing, not the math.

Inside @formula, Julia treats all symbols as columns in a the input DataFrame. x_0 is not a variable in your data frame, so its throwing an error.

Just make the x - x_0 variables outside the @formula. Maybe call it recentered_x. Then run lm with a @formula acting on the recentered_x.

bertschi · September 23, 2022, 4:59pm

Expanding is the right idea, yet, from the perspective of the model x_0 is a constant parameter, i.e.,

(x - x_0)^2 = x^2 - 2 * x_0 * x + x_0^2 = x^2 + b * x + c

Thus, you could fit the more general model with the formula z ~ x^2 + x + 1. I’m not quite sure, how you could impose the constraint that b = 2*x_0 and c = x_0^2, i.e., c = (b/2)^2, in a generalized linear model.

bertulli · September 23, 2022, 5:31pm

Thanks to both!
Actually, I wanted to estimate both the coefficients, AND the vertex coordinates (x_0 and y_0). But if I did my calcs right, this makes the problem non linear, so using GLM.lm is not the way to go. Is this correct? And if so, does Julia have a quadratic equivalent?

P.S. Interestingly enough, MATLAB can evaluate it (so I suppose it recognizes it and switches to a non-linear model), but OTOH it cannot deal with more than 2 predictors (for what I understood).

pdeffebach · September 23, 2022, 5:34pm

So you want to solve the problem below?

\min_{x_0, \vec{\beta}} \ \sum_{i = 1}^{N} \left(y_i - \beta_1 (x_{1i} - x_0) - \beta_2(x_{2i} - x_0)\right)^2

bertulli · September 23, 2022, 5:38pm

Yes, exactly

BTW sorry for the noob question: how do you typeset LaTeX in posts?

pdeffebach · September 23, 2022, 5:39pm

GLM.jl can’t do that for you. Hopefully someone can chime in with an appropriate tutorial for an optimization package.

You do math in Discourse

$$
Like this
$$

bertulli · September 23, 2022, 5:45pm

Thanks (for both things)! Then it’s what I suspected. I don’t know enough about neither Julia nor regression analysis to create it myself, so I’ll check for other packages or revert to MATLAB/Octave if needed. Kudos!

ayushpatnaikgit · September 24, 2022, 5:14am

Perhaps something like this:

using Optim
using DataFrames
df = DataFrame(y = rand(10), x₁ = rand(10), x₂ = rand(10), x₀ = repeat([rand(1:10)], 10))

f(β) = sum((df.y .- β[1]*(df.x₁ .- df.x₀) - β[2]*(df.x₂ .- df.x₀)).^2) # equation written by pdeffebach

result = optimize(f, [0.0, 0.0]) # Ideally, put suitable initial values instead of [0.0,0.0]
Optim.minimizer(result)

If x₀ is not a constant, it can be β[3].

bertulli · September 26, 2022, 10:36am

Unfortunately, this doesn’t work. Even using the correct initial values, it overestimates a parameter. Let me explain:
I have this code to generate a DataFrame (generate_quadratic.jl):

using DataFrames
using CSV

x = range(-50, 50, length=101);
y = range(-70, 50, length=51);
domain = Iterators.product(x,y) |> collect
x_0 = 10;
y_0 = 20;
f((x,y)) = (x, y, (x - x_0)^2 + 2 * (y - y_0)^2 + 10*randn())
points = vec(map(f, domain))

@time(df = DataFrame(NamedTuple{(:x, :y, :z)}.(points)))
# print(df)

CSV.write("raw_quadratic_data.csv", df);

Then I tweaked your suggestion like this to estimate it (quadratic.jl):

using Optim
using DataFrames
using StatsPlots

df = DataFrame(CSV.File("raw_quadratic_data.csv"));

f(β) = sum((df.z .- β[1]*(df.x .- β[3]).^2 - β[2]*(df.y .- β[4])).^2) # equation written by pdeffebach

@df df scatter(:x, :y, :z)
result = optimize(f, [1.0, 2.0, 10.0, 20.0]) # Ideally, put suitable initial values instead of [0.0,0.0]
coefficients = Optim.minimizer(result)

best_fit_surface(x, y) = coefficients[1] * (x - coefficients[3])^2 + coefficients[2] * (y - coefficients[4])^2;

x = range(-50, 50, length=101);
y = range(-70, 50, length=51);
graph = surface!(x, y, best_fit_surface)

savefig(graph, "tentative_quadratic_plot.png");

I got in the REPL

julia> coefficients = Optim.minimizer(result)
4-element Vector{Float64}:
    1.0003341599739208
 -120.00187363980439
    9.997530795236802
   25.795981453902133

and the resulting plot is:
tentative_quadratic_plot

Am I doing something wrong?

ayushpatnaikgit · September 26, 2022, 1:53pm

I have made minor changes, please have a look:

using DataFrames
using Optim

x = range(-50, 50, length=101);
y = range(-70, 50, length=51);
domain = Iterators.product(x,y) |> collect
β = [1.0, 2.0, 10.0, 20.0]
f((x,y)) = (x, y, (β[1]*(x - β[3])^2 + β[2] * (y - β[4])^2 +  10*randn()))
points = vec(map(f, domain))

df = DataFrame(NamedTuple{(:x, :y, :z)}.(points))

function g(β) # since f is already used. 
    z = df.z; x = df.x; y = df.y
    x₀ = β[3]; y₀ = β[4]
    ẑ =  β[1]*(x .- x₀).^2 + β[2]*(y .- y₀).^2 
    return sum((z - ẑ).^2)
end

result = optimize(g, [1.0, 2.0, 10.0, 20.0]) # Actual initial conditions
coefficients = Optim.minimizer(result) # looks like it's the right answer. 

result = optimize(g, [1.6, 20.0, 0.0, 0.0]) # Some random initial conditions. 
coefficients = Optim.minimizer(result) # Still the correct answer.

bertulli · September 26, 2022, 4:43pm

Indeed it works! Thank you!

I’m not sure if I can flag two answers as solutions. In case I can’t, I’m going to credit you in a separate thread, to keep the question-solution pair more coherent.

ayushpatnaikgit · September 26, 2022, 4:51pm

Let me know if you need help with making it faster.

There are several algorithms for optimisation, LBFGS if often the fastest. You can also use automatic differentiation. I think adding constraints on beta will also make it faster.

pdeffebach · September 26, 2022, 5:24pm

There are many performance improvements to be had before changing the algorithm. In particular, you want to “capture” the variables used in the optimization inside a let block.

Also, never do df.z in performant code, since DataFrames are type-unstable. Always work on the vector directly.

julia> z = df.z; x = df.x; y = df.y;

julia> g = let x = x, y = y, z = z 
           β -> begin 
                   x₀ = β[3]; y₀ = β[4]
                   ẑ =  β[1]*(x .- x₀).^2 + β[2]*(y .- y₀).^2 
                   return sum((z - ẑ).^2)
           end
       end;

Topic		Replies	Views
The simplest linear fit with GLM Tooling glm	13	5287	November 11, 2021
Quadratic (paraboloid) model fitting Statistics regression , fit , curve-fitting	0	497	September 29, 2022
Using all independent variables with @formula in a multiple linear model New to Julia glm	18	4423	January 29, 2023
How to pass a function in a variable inside a linear regression? New to Julia regression , glm	2	628	July 28, 2020
Very basic GLM help New to Julia regression , fit , glm	6	3577	November 9, 2020

How can I tell Julia what are the parameters in a linear model?

Related topics