How can I tell Julia what are the parameters in a linear model?

Hi all!

So, I’m experimenting with the linear regression ecosystem (GLM, Plots, DataFrames, StatsPlots and CSV). I understood how to produce a simple linear regression model and a multivariate one.
I know that OLS works even when your function is not linear in the variables, provided it is in the parameters. For instance, I tried fitting this paraboloid:

# excerpt to generate it
f((x,y)) = (x, y, x^2 + 2 * y^2 + 10*randn())

# excerpt to fit it
fm = @formula(z ~ x^2 + y^2)

and it works! Yay!

However, now I’m trying shifting the vertex:

f((x,y)) = (x, y, (x - x_0)^2 + 2 * (y - y_0)^2 + 10*randn())

Of course, now the estimate is off. How can I tell Julia that there are two more parameters to fit?

I’ve tried this

fm = @formula(z ~ (x - x_0)^2 +  (y - y_0)^2)

but it panics when I try to fit it:

julia> fm = @formula(z ~ (x - x_0)^2 +  (y - y_0)^2)
  (x,x_0)->(x - x_0) ^ 2
  (y,y_0)->(y - y_0) ^ 2

julia> @time(model = lm(fm, df))
ERROR: ArgumentError: There isn't a variable called 'x_0' in your data; the nearest names appear to be: x
 [1] ModelFrame(f::FormulaTerm{Term, Tuple{FunctionTerm{typeof(^), var"#87#91", (:x, :x_0)}, FunctionTerm{typeof(^), var"#89#93", (:y, :y_0)}}}, data::NamedTuple{(:x, :y, :z), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float64}}}; model::Type{LinearModel}, contrasts::Dict{Symbol, Any})
   @ StatsModels ~/.julia/packages/StatsModels/6uCgU/src/modelframe.jl:76
 [2] fit(::Type{LinearModel}, f::FormulaTerm{Term, Tuple{FunctionTerm{typeof(^), var"#87#91", (:x, :x_0)}, FunctionTerm{typeof(^), var"#89#93", (:y, :y_0)}}}, data::DataFrame, args::Nothing; contrasts::Dict{Symbol, Any}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ StatsModels ~/.julia/packages/StatsModels/6uCgU/src/statsmodel.jl:85
 [3] fit
   @ ~/.julia/packages/StatsModels/6uCgU/src/statsmodel.jl:78 [inlined]
 [4] #lm#5
   @ ~/.julia/packages/GLM/P0Ris/src/lm.jl:184 [inlined]
 [5] lm (repeats 2 times)
   @ ~/.julia/packages/GLM/P0Ris/src/lm.jl:184 [inlined]
 [6] top-level scope
   @ ./timing.jl:262 [inlined]
 [7] top-level scope
   @ ./REPL[287]:0

Is there a way to do it?


I was reasoning about my problem and it occured to me that expanding (x - x_0)^2 I get:

(x - x_0)^2 = x^2 - 2x_0*x + x_0^2

Is the last term (x_0^2) making the problem non-linear? Is this the meaning of the error produced by Julia? If yes, is there a way to solve it (which I presume is another function, for non linear models)?

The problem is just about parsing, not the math.

Inside @formula, Julia treats all symbols as columns in a the input DataFrame. x_0 is not a variable in your data frame, so its throwing an error.

Just make the x - x_0 variables outside the @formula. Maybe call it recentered_x. Then run lm with a @formula acting on the recentered_x.


Expanding is the right idea, yet, from the perspective of the model x_0 is a constant parameter, i.e.,

(x - x_0)^2 = x^2 - 2 * x_0 * x + x_0^2 = x^2 + b * x + c

Thus, you could fit the more general model with the formula z ~ x^2 + x + 1. I’m not quite sure, how you could impose the constraint that b = 2*x_0 and c = x_0^2, i.e., c = (b/2)^2, in a generalized linear model.

1 Like

Thanks to both!
Actually, I wanted to estimate both the coefficients, AND the vertex coordinates (x_0 and y_0). But if I did my calcs right, this makes the problem non linear, so using GLM.lm is not the way to go. Is this correct? And if so, does Julia have a quadratic equivalent?

P.S. Interestingly enough, MATLAB can evaluate it (so I suppose it recognizes it and switches to a non-linear model), but OTOH it cannot deal with more than 2 predictors (for what I understood).

So you want to solve the problem below?

\min_{x_0, \vec{\beta}} \ \sum_{i = 1}^{N} \left(y_i - \beta_1 (x_{1i} - x_0) - \beta_2(x_{2i} - x_0)\right)^2

Yes, exactly

BTW sorry for the noob question: how do you typeset LaTeX in posts?

GLM.jl can’t do that for you. Hopefully someone can chime in with an appropriate tutorial for an optimization package.

You do math in Discourse

Like this

Thanks (for both things)! Then it’s what I suspected. I don’t know enough about neither Julia nor regression analysis to create it myself, so I’ll check for other packages or revert to MATLAB/Octave if needed. Kudos!

Perhaps something like this:

using Optim
using DataFrames
df = DataFrame(y = rand(10), x₁ = rand(10), x₂ = rand(10), x₀ = repeat([rand(1:10)], 10))

f(β) = sum((df.y .- β[1]*(df.x₁ .- df.x₀) - β[2]*(df.x₂ .- df.x₀)).^2) # equation written by pdeffebach

result = optimize(f, [0.0, 0.0]) # Ideally, put suitable initial values instead of [0.0,0.0]

If x₀ is not a constant, it can be β[3].

Unfortunately, this doesn’t work. Even using the correct initial values, it overestimates a parameter. Let me explain:
I have this code to generate a DataFrame (generate_quadratic.jl):

using DataFrames
using CSV

x = range(-50, 50, length=101);
y = range(-70, 50, length=51);
domain = Iterators.product(x,y) |> collect
x_0 = 10;
y_0 = 20;
f((x,y)) = (x, y, (x - x_0)^2 + 2 * (y - y_0)^2 + 10*randn())
points = vec(map(f, domain))

@time(df = DataFrame(NamedTuple{(:x, :y, :z)}.(points)))
# print(df)

CSV.write("raw_quadratic_data.csv", df);

Then I tweaked your suggestion like this to estimate it (quadratic.jl):

using Optim
using DataFrames
using StatsPlots

df = DataFrame(CSV.File("raw_quadratic_data.csv"));

f(β) = sum((df.z .- β[1]*(df.x .- β[3]).^2 - β[2]*(df.y .- β[4])).^2) # equation written by pdeffebach

@df df scatter(:x, :y, :z)
result = optimize(f, [1.0, 2.0, 10.0, 20.0]) # Ideally, put suitable initial values instead of [0.0,0.0]
coefficients = Optim.minimizer(result)

best_fit_surface(x, y) = coefficients[1] * (x - coefficients[3])^2 + coefficients[2] * (y - coefficients[4])^2;

x = range(-50, 50, length=101);
y = range(-70, 50, length=51);
graph = surface!(x, y, best_fit_surface)

savefig(graph, "tentative_quadratic_plot.png");

I got in the REPL

julia> coefficients = Optim.minimizer(result)
4-element Vector{Float64}:

and the resulting plot is:

Am I doing something wrong?

I have made minor changes, please have a look:

using DataFrames
using Optim

x = range(-50, 50, length=101);
y = range(-70, 50, length=51);
domain = Iterators.product(x,y) |> collect
β = [1.0, 2.0, 10.0, 20.0]
f((x,y)) = (x, y, (β[1]*(x - β[3])^2 + β[2] * (y - β[4])^2 +  10*randn()))
points = vec(map(f, domain))

df = DataFrame(NamedTuple{(:x, :y, :z)}.(points))

function g(β) # since f is already used. 
    z = df.z; x = df.x; y = df.y
    x₀ = β[3]; y₀ = β[4]
    ẑ =  β[1]*(x .- x₀).^2 + β[2]*(y .- y₀).^2 
    return sum((z - ẑ).^2)

result = optimize(g, [1.0, 2.0, 10.0, 20.0]) # Actual initial conditions
coefficients = Optim.minimizer(result) # looks like it's the right answer. 

result = optimize(g, [1.6, 20.0, 0.0, 0.0]) # Some random initial conditions. 
coefficients = Optim.minimizer(result) # Still the correct answer. 

Indeed it works! :tada::confetti_ball: Thank you!

I’m not sure if I can flag two answers as solutions. In case I can’t, I’m going to credit you in a separate thread, to keep the question-solution pair more coherent.

1 Like

Let me know if you need help with making it faster.

There are several algorithms for optimisation, LBFGS if often the fastest. You can also use automatic differentiation. I think adding constraints on beta will also make it faster.

There are many performance improvements to be had before changing the algorithm. In particular, you want to “capture” the variables used in the optimization inside a let block.

Also, never do df.z in performant code, since DataFrames are type-unstable. Always work on the vector directly.

julia> z = df.z; x = df.x; y = df.y;

julia> g = let x = x, y = y, z = z 
           β -> begin 
                   x₀ = β[3]; y₀ = β[4]
                   ẑ =  β[1]*(x .- x₀).^2 + β[2]*(y .- y₀).^2 
                   return sum((z - ẑ).^2)