Best way for linear regression problem on product features

JADekker · April 7, 2025, 5:11pm

Oops, I see. I’m sorry for the confusion. Thanks for the effort anyway! Do you think a modification of your method could still work, or does it crucially exploit all of the Kronecker-product structure?

mikmoore · April 7, 2025, 5:50pm

To be honest, I still can’t see why
\min_b \| \text{vec}(\text{diag}(Y)) - (X \otimes Z) \text{vec}(b) \|_2
(as suggested in the second post) isn’t the problem you’re trying to solve. So I don’t understand the problem and definitely can’t recommend how to solve it. Sorry.

stevengj · April 7, 2025, 7:24pm

In fact, this was not correct. Minimizing \Vert Y - Z B X^T \Vert (or the equivalent expression with Kronecker products) looks at all the elements of Z B X^T, when in fact you only want to minimize the differences in the diagonal elements.

So, @mikmoore’s solution is actually not applicable here because it is solving the wrong problem (my fault), in addition to being too slow (because \operatorname{diagm}(y) is too big here).

The iterative algorithm (above) as well as the direct algorithm (above) solve the correct problem, though.

JADekker · April 9, 2025, 11:32am

In my use-case unfortunately, the method does not converge in reasonable time, so I’ll need to use an approximate method that I developed earlier in this use case. I’ll try to think about whether it is possible to develop a more direct method at a later stage.

stevengj · April 9, 2025, 12:18pm

That may mean that your problem is badly conditioned — have you considered using a Tikhonov (ridge regression) regularization? That corresponds to setting a damping parameter \lambda > 0 in IterativeSolvers.lsmr. This should improve the conditioning and convergence rate, and robustify the solutions to noise, at the expense of residual — you’ll have to experiment to find an appropriate regularization strength for your application.

The corresponding “direct”/non-iterative algorithm (ala the code above) is a variant of the normal equations:

B = reshape(cholesky(Hermitian(Wᵀ * Wᵀ' + λ * I)) \ (Wᵀ * Y), m, n)

(which is dominated by the O(N(mn)^2) cost of computing W^T W).

JADekker · April 9, 2025, 12:22pm

I originally thought my problem would not be badly conditioned, but that may be a very good direction to look into. Thanks!

Mattriks · April 10, 2025, 4:46am

I thought I’d try another method. Inverse regression methods are useful if one is interested in a low-rank predictor subspace:

using DataFrames, LinearAlgebra, Statistics

function indmat(Y::AbstractVector; nq=4)
    # Slice Y using quantiles
    q = quantile(Y, range(0, 1.0, length=nq+1))
    @show q
    z = [zeros(nq-1); 0.1]
    ind = (Y .>= q[1:end-1]')  .&  (Y .< (q[2:end] .+ z)')
    return ind*1.0
end


function lda(X, Y)
    # Inverse regression
    # similar to linear discriminant analysis
    μX = mean(X, dims=1)
    Xw = svd(X.-μX).U # whiten X
    Xw ./= std(Xw, dims=1) # make unit variance
    Ya = indmat(Y) # slice Y
    Xhat = Ya*(cov(Ya)\cov(Ya,Xw))
    # Try a truncated or approx SVD here instead?
    F = svd(Xhat)
    DC = Xw*F.V[:, axes(Ya, 2)]     # Features or Direction Components
    df = DataFrame(DC.-mean(DC, dims=1), [:DC1, :DC2, :DC3, :DC4])
    df.Y = Y  
    return df
end


m, n, N = 100, 500, 100
X, Z, Y = rand(N, n), rand(N, m), rand(N)
Wᵀ =reshape([conj(X[i,k] * Z[i,ℓ]) for ℓ in 1:m, k in 1:n, i in 1:N], m*n,N)
@show size(Wᵀ);

df = lda(Wᵀ', Y)
dfs = stack(df, Not(:Y)) 
plot(dfs, x=:value, y=:Y, xgroup=:variable,  Geom.subplot_grid(
Geom.point, free_x_axis=true), Guide.xlabel(nothing))

These random numbers have interesting features! If you want a predictive model, then simply build it from the DCs.

JADekker · April 10, 2025, 1:54pm

Thanks! In my problem unfortunately, I can’t do data reduction (I know that all basis functions may matter, and the problem is in fact one of deterministic least-squares)

Mattriks · April 11, 2025, 12:05am

It’s simple to transfer from the subspace back to the original space (and would be interesting to see how close the prediction is). It should also be unbiased, whereas the ridge method is biased, but can be iteratively corrected (https://doi.org/10.1080/13504850500425840).

Topic		Replies	Views
Linear regression from scratch using QR decomposition General Usage linearalgebra , regression , qr	3	1149	November 12, 2021
Optimzing many linear solves Performance	2	271	June 14, 2023
Use QR factorization efficiently Performance linearalgebra	12	5223	January 27, 2018
Least squares in julia General Usage linearalgebra	7	688	September 26, 2023
QR factorization with custom dot product Numerics linearalgebra	5	1902	December 30, 2016

Best way for linear regression problem on product features

Related topics