Best way for linear regression problem on product features

JADekker · April 7, 2025, 7:52am

Hi, I need to solve the following least-squares problem efficiently, where X and Z denote two matrices of features. I can transform this into a least-squares problem manually by first creating a feature matrix W by taking all products of the columns in X and Z, but that seems a bit burdensome with many features.

I wish to minimize the least-squares error in the following over b_{l,k}:

Y_i = \sum_k\sum_l b_{lk} X_{i,k}Z_{i,l}

If necessary, I can provide X, Y and Z in batches.

Say X has around 500 features and Z has around 20-100 features. What is the best package/function to use here?

In a sense the product here implies some kind of row-wise Kronecker product, which hints that there may be a smart QR decomposition here that is substantially easier to solve.

JADekker · April 7, 2025, 10:45am

If it helps, we can assume here that the X_{i,k} are independent of the Z_{i,l} .

stevengj · April 7, 2025, 11:44am

Right, your equation is ~~equivalent to~~ related but not equivalent to (see below)

\min_B \Vert Y - ZBX^T \Vert_{\text{F}} \, ,

where Y is a diagonal matrix of the Y_i values, B is the matrix of b_{\ell,k} values, and \Vert \cdot \Vert_{\text{F}} is the Frobenius norm. An efficient method for this problem is described in Van Loan and Pitsianis (1993).

If we let y = \operatorname{vec}(Y) and b = \operatorname{vec}(B), then this becomes the ordinary least-squares problem:

\min_b \Vert y - (X \otimes Z) b \Vert \, ,

You could, of course, solve this directly in Julia with b = kron(X, Z) \ y where y = vec(diagm(Y)) and then you obtain your matrix B by reshape(b, m, n). This method has the advantage of being easy and built-in, but X \otimes Z gets big quickly. A more efficient method was described by Faucett and Fulton (1994).

Off the top of my head, I don’t know of a Julia package that implements one of these matrix/Kronecker least-squares algorithms, but they don’t seem terrible to implement yourself. Another possibility is just do conjugate-gradient (CG) iterations on the normal equations, where you represent the linear operator implicitly (converting it lazily into matrix multiplies when a Kronecker product acts on a vector — Kronecker.jl can perhaps help with this).

The formulations above don’t directly take advantage of the fact that your Y matrix is diagonal, however. The easiest way to exploit that seems like an iterative algorithm such as CG on the normal equations for a formulation where you treat \operatorname{diag}(ZBX^T) implicitly as a linear operator on b. (The usual caveat applies that working directly with the normal equations squares the condition number, however, so it should only be done if your problem is sufficiently well conditioned, or perhaps if you include a Tikhonov regularization.)

JADekker · April 7, 2025, 12:09pm

@stevengj thank you for your elaborate reply. I see the problem with using kron directly indeed (my computer did not enjoy that).

With respect to the last paragraph: I’m not experienced with implementing linear operators implicitly and writing conjugate-gradient iterations, would you happen to know a pointer for a (pseudo-)algorithm that applies in this case?

Edit: I’m in a simulation-based context and I can generate as many samples as I want, so I do not expect conditioning problems.

stevengj · April 7, 2025, 12:47pm

Let’s define the linear operator L[b] = \left(\sum_{k,\ell} b_{\ell,k} X_{i,k} Z_{i,\ell}\right)_{i=1,\ldots,N} = \operatorname{diag}(ZBX^T). You could implement this in Julia with e.g.

function L(b::AbstractMatrix, X::AbstractMatrix, Z::AbstractMatrix)
    Base.require_one_based_indexing(b, X, Z)
    m,n = size(b)
    N = size(X,1)
    (n == size(X,2) && m == size(Z,2) && N == size(Z,1)) || throw(DimensionMismatch())
    Y = Vector{typeof(zero(eltype(b)) * zero(eltype(X)) * zero(eltype(Z)))}(undef, N)
    @inbounds for i = 1:N
        Yᵢ = zero(eltype(Y))
        for k = 1:n; @simd for ℓ = 1:m
            Yᵢ += b[ℓ,k] * X[i,k] * Z[i,ℓ]
        end; end
        Y[i] = Yᵢ
    end
    return Y
end

(You could also use something like Tullio.jl, but unfortunately that doesn’t work with the latest Julia. There are various other ways to implement this more compactly, but I wanted to avoid any unnecessary allocations since your matrices might be large.)

For least-squares, you will also need the transposed operator:

L^T[Y] = \left(\sum_{i} Y_i X_{i,k} Z_{i,\ell}\right)_{\ell = 1\ldots m, k = 1\ldots n}

which could be implemented e.g. with:

function Lᵀ(Y::AbstractVector, X::AbstractMatrix, Z::AbstractMatrix)
    Base.require_one_based_indexing(Y, X, Z)
    n, m = size(X,2), size(Z,2)
    N = length(Y)
    N == size(X,1) == size(Z,1) || throw(DimensionMismatch())
    b = Matrix{typeof(zero(eltype(Y)) * zero(eltype(X)) * zero(eltype(Z)))}(undef, m,n)
    @inbounds for k = 1:n, ℓ = 1:m
        b_ℓk = zero(eltype(Y))
        @simd for i = 1:N
            b_ℓk += Y[i] * conj(X[i,k] * Z[i,ℓ])
        end
        b[ℓ,k] = b_ℓk 
    end
    return b
end

Solving the least-square problem is equivalent (modulo roundoff errors) to solving the normal equations L^T [L [b]] = L^T [y]. So, you just need to pass these linear operators to an iterative least-square algorithm (defined implicitly from the mapping functions with e.g. LinearMaps.jl), with reshape and vec to convert matrices to/from vectors.

For example, using IterativeSolvers.jl and some random data:

using IterativeSolvers, LinearMaps

m, n, N = 4, 3, 100
X, Z, Y = rand(N, n), rand(N, m), rand(N)

op = FunctionMap{Float64,false}(bvec -> L(reshape(bvec,m,n), X, Z), Y -> vec(Lᵀ(Y, X, Z)), N, m*n)
b = reshape(lsqr(op, Y), m, n)

(Caveat: the code above runs, but I haven’t tested it carefully.) Of course, you should put the above code into a function if you want to run it on a large-scale problem, as otherwise the global variables will slow things down.

JADekker · April 7, 2025, 12:50pm

Thank you, that is an amazing and very elaborate answer! I will study it in detail and try to implement this correctly and see how it fares in my use-case!

stevengj · April 7, 2025, 12:55pm

(Of course, maybe someone will chime in and comment that some statistics package like GLM.jl already implements all this in a much better way.)

stevengj · April 7, 2025, 1:34pm

For reference, here is an implementation of this strategy:

Wᵀ = reshape([conj(X[i,k] * Z[i,ℓ]) for ℓ in 1:m, k in 1:n, i in 1:N], m*n, N)
reshape(Wᵀ' \ Y, m, n)

I’ve verified that it gives the same answer (to 12 digits) as the iterative lsqr example above, which gives me some confidence that my code is correct.

The complexity of the direct OLS approach is O(N(mn)^2). In contrast, each iteration of the lsqr approach with implicit (matrix-free) operators is O(Nmn). So, if the iteration converges quickly enough, and mn is large enough, the latter should be faster

JADekker · April 7, 2025, 1:37pm

Thanks! For now, I’ve noticed that your implementation is a roughly 5x speed-up already for my use-case. lsmr instead of lsqr cuts off 30% more with only a minor minor decrease in MSE. I notice that LinearMaps has a facesplitting function, which might be a package implementation of your code. However, it seems that solving using LinearMaps.facesplitting(X,Z) is (roughly 20-40%) slower than your code.

JADekker · April 7, 2025, 1:47pm

For reference, this version of L takes roughly 3ms (compared to 8ms) on my device by looping over columns on the outside and over rows on the inside. (At the cost of having to initialize Y with zeros)

function L_alt(b::AbstractMatrix, X::AbstractMatrix, Z::AbstractMatrix)
    Base.require_one_based_indexing(b, X, Z)
    N = size(X,1)
    #Y = Vector{promote_type(eltype(b), eltype(X), eltype(Z))}(undef, N)
    Y = zeros(promote_type(eltype(b), eltype(X), eltype(Z)), N)
    @inbounds for k in axes(X, 2), ℓ in axes(Z, 2); @simd for i in eachindex(Y)
            Y[i] += b[ℓ,k] * X[i,k] * Z[i,ℓ]
        end
    end
    return Y
end

stevengj · April 7, 2025, 1:55pm

If you are doing performance optimization, I would also consider defining the operators to act in-place, i.e. define them as Lᵀ!(b::AbstractMatrix, Y::AbstractVector, X::AbstractMatrix, Z::AbstractMatrix) (outputting in-place in b) and L!(Y::AbstractVector, b::AbstractMatrix, X::AbstractMatrix, Z::AbstractMatrix) (outputting in-place in Y), and define your op with the in-place option

op = FunctionMap{Float64,true}((Y, bvec) -> L!(Y, reshape(bvec,m,n), X, Z), (bvec, Y) -> vec(Lᵀ!(reshape(bvec,m,n), Y, X, Z)), N, m*n)

If N is huge you might also consider parallelizing (e.g. with @threads) over this dimension.

JADekker · April 7, 2025, 2:12pm

I’d expect multithreading overhead to dominate any potential gains here (for me, runtimes shot up an order of magnitude), as the operations themselves are very small; a different way of iterating may lead to improvements. Inplace seems to be a marginal improvement indeed!

Thank you so much for all your help!

JADekker · April 7, 2025, 2:24pm

For reference, a numerical example (where I forgot the burn-in, so take the first estimate runtime with a grain of salt…)

using Random, Distributions, LinearAlgebra, IterativeSolvers, LinearMaps, BenchmarkTools
Random.seed!(42)

function L(b::AbstractMatrix, X::AbstractMatrix, Z::AbstractMatrix)
    Base.require_one_based_indexing(b, X, Z)
    N = size(X,1)
    Y = zeros(promote_type(eltype(b), eltype(X), eltype(Z)), N)
    @inbounds for k in axes(X, 2), ℓ in axes(Z, 2);
        @simd for i in eachindex(Y)
            Y[i] += b[ℓ,k] * X[i,k] * Z[i,ℓ]
        end
    end
    return Y
end

function L!(Y::AbstractVector, b::AbstractMatrix, X::AbstractMatrix, Z::AbstractMatrix)
    Base.require_one_based_indexing(b, X, Z)
    Y .= 0
    @inbounds for k in axes(X, 2), ℓ in axes(Z, 2);
        @simd for i in eachindex(Y)
            Y[i] += b[ℓ,k] * X[i,k] * Z[i,ℓ]
        end
    end
    return Y
end

function Lᵀ(Y::AbstractVector, X::AbstractMatrix, Z::AbstractMatrix)
    Base.require_one_based_indexing(Y, X, Z)
    n, m = size(X,2), size(Z,2)
    b = Matrix{promote_type(eltype(Y), eltype(X), eltype(Z))}(undef, m,n)
    @inbounds for k in axes(X, 2), ℓ in axes(Z, 2)
        b_ℓk = zero(eltype(Y))
        @simd for i in eachindex(Y)
            b_ℓk += Y[i] * X[i,k] * Z[i,ℓ]
        end
        b[ℓ,k] = b_ℓk 
    end
    return b
end

function Lᵀ!(b::AbstractMatrix, Y::AbstractVector, X::AbstractMatrix, Z::AbstractMatrix)
    Base.require_one_based_indexing(Y, X, Z)
    @inbounds for k in axes(X, 2), ℓ in axes(Z, 2)
        b_ℓk = zero(eltype(b))
        @simd for i = eachindex(Y)
            b_ℓk += Y[i] * X[i,k] * Z[i,ℓ]
        end
        b[ℓ,k] = b_ℓk 
    end
    return b
end


function test(N, d, k)
    X = rand(N, d)
    Z = rand(N, k)

    XZ = Matrix{Float64}(undef, N, d*k)
    @time for di in 1:d
        XZ[:, 1 + (di - 1)*k:di*k] .= view(X, :, di) .* Z
    end

    β = rand(d*k)
    Y = XZ * β + randn(N)

    vec_l = 10
    d_vec = floor.(Int, range(1, d, vec_l))
    d_runtimes = Vector{Float64}(undef, vec_l)
    d_runtimes2 = Vector{Float64}(undef, vec_l)
    mse_est = Vector{Float64}(undef, vec_l)
    mse_est_2 = Vector{Float64}(undef, vec_l)
    d_runtimes3 = Vector{Float64}(undef, vec_l)
    mse_est_3 = Vector{Float64}(undef, vec_l)
    d_runtimes4 = Vector{Float64}(undef, vec_l)
    mse_est_4 = Vector{Float64}(undef, vec_l)
    
    for (i, dsub) in enumerate(d_vec)
        X_sub = X[:, 1:dsub]
        XZ_sub = Matrix{Float64}(undef, N, dsub*k)
        for di in 1:dsub
            XZ_sub[:, 1 + (di - 1)*k:di*k] .= view(X_sub, :, di) .* Z
        end

        println("Estimating $i using inverse")
        β̂ = @timed XZ_sub \ Y
        println("Took $(β̂.time) seconds")
        mse_est[i] = mean((XZ_sub * β̂.value - Y).^2)
        d_runtimes[i] = β̂.time

        println("Estimating $i using lsqr")
        op = FunctionMap{Float64,false}(bvec -> L(reshape(bvec,k,dsub), X_sub, Z), Y -> vec(Lᵀ(Y, X_sub, Z)), N, k*dsub)
        op_inplace = FunctionMap{Float64,true}((Y, bvec) -> L!(Y, reshape(bvec, k, dsub), X_sub, Z), (bvec, Y) -> vec(Lᵀ!(reshape(bvec, k, dsub), Y, X_sub, Z)), N, k*dsub)
        β̂ = @timed reshape(lsqr(op, Y), k, dsub)
        println("Took $(β̂.time) seconds")
        mse_est_2[i] = mean((L(reshape(β̂.value,k,dsub), X_sub, Z) - Y).^2)

        d_runtimes2[i] = β̂.time

        println("Estimating $i using lsmr")
        β̂ = @timed reshape(lsmr(op, Y), k, dsub)
        println("Took $(β̂.time) seconds")
        mse_est_3[i] = mean((L(reshape(β̂.value,k,dsub), X_sub, Z) - Y).^2)
        d_runtimes3[i] = β̂.time

        println("Estimating $i using lsmr inplace")
        β̂ = @timed reshape(lsmr(op_inplace, Y), k, dsub)
        println("Took $(β̂.time) seconds")
        mse_est_4[i] = mean((L(reshape(β̂.value,k,dsub), X_sub, Z) - Y).^2)
        d_runtimes4[i] = β̂.time
    end
    println(mse_est)
    println(mse_est_2)
    println(mse_est_3)
    println(mse_est_4)
    
    println(d_runtimes)
    println(d_runtimes2)
    println(d_runtimes3)
    println(d_runtimes4)
end

test(100_000, 50, 100)

gives

0.379931 seconds
Estimating 1 using inverse
Took 0.228198375 seconds
Estimating 1 using lsqr
Took 0.20376975 seconds
Estimating 1 using lsmr
Took 0.114865875 seconds
Estimating 1 using lsmr inplace
Took 0.12395125 seconds
Estimating 2 using inverse
Took 3.092642 seconds
Estimating 2 using lsqr
Took 0.784247291 seconds
Estimating 2 using lsmr
Took 0.615687167 seconds
Estimating 2 using lsmr inplace
Took 0.615284 seconds
Estimating 3 using inverse
Took 9.574711583 seconds
Estimating 3 using lsqr
Took 2.021692375 seconds
Estimating 3 using lsmr
Took 1.425100083 seconds
Estimating 3 using lsmr inplace
Took 1.395683833 seconds
Estimating 4 using inverse
Took 21.734518958 seconds
Estimating 4 using lsqr
Took 3.4526595 seconds
Estimating 4 using lsmr
Took 2.303357708 seconds
Estimating 4 using lsmr inplace
Took 2.277328375 seconds
Estimating 5 using inverse
Took 37.174790166 seconds
Estimating 5 using lsqr
Took 5.026946708 seconds
Estimating 5 using lsmr
Took 3.495458833 seconds
Estimating 5 using lsmr inplace
Took 3.433379792 seconds
Estimating 6 using inverse
Took 57.444847375 seconds
Estimating 6 using lsqr
Took 6.52967775 seconds
Estimating 6 using lsmr
Took 4.626406417 seconds
Estimating 6 using lsmr inplace
Took 4.491984125 seconds
Estimating 7 using inverse
Took 79.451948959 seconds
Estimating 7 using lsqr
Took 8.227401208 seconds
Estimating 7 using lsmr
Took 5.232909541 seconds
Estimating 7 using lsmr inplace
Took 5.240785166 seconds
Estimating 8 using inverse
Took 110.555767958 seconds
Estimating 8 using lsqr
Took 9.65445625 seconds
Estimating 8 using lsmr
Took 6.07731025 seconds
Estimating 8 using lsmr inplace
Took 6.072688333 seconds
Estimating 9 using inverse
Took 136.001674292 seconds
Estimating 9 using lsqr
Took 11.162575667 seconds
Estimating 9 using lsmr
Took 6.841667083 seconds
Estimating 9 using lsmr inplace
Took 6.843795709 seconds
Estimating 10 using inverse
Took 177.454427917 seconds
Estimating 10 using lsqr
Took 15.239018417 seconds
Estimating 10 using lsmr
Took 10.854387917 seconds
Estimating 10 using lsmr inplace
Took 11.097065416 seconds
[97486.16854910454, 18328.43239810469, 9018.555817126977, 4893.845608371163, 3189.6915435352157, 1989.801489581642, 1294.25192687956, 706.1769890062392, 347.7741537705133, 0.9500232830021116]
[97486.16854910481, 18328.432398146768, 9018.555817133984, 4893.845608435037, 3189.691543539357, 1989.8014896004731, 1294.2519268813387, 706.1769890090135, 347.7741537748462, 0.9500232830059561]
[97486.16854936874, 18328.432408960995, 9018.55590730859, 4893.845862939049, 3189.691580688102, 1989.8015190086182, 1294.2519695989643, 706.1770434840151, 347.77421115991575, 0.9500233151409269]
[97486.16854936874, 18328.432408960995, 9018.55590730859, 4893.845862939049, 3189.691580688102, 1989.8015190086182, 1294.2519695989643, 706.1770434840151, 347.77421115991575, 0.9500233151409269]
[0.228198375, 3.092642, 9.574711583, 21.734518958, 37.174790166, 57.444847375, 79.451948959, 110.555767958, 136.001674292, 177.454427917]
[0.20376975, 0.784247291, 2.021692375, 3.4526595, 5.026946708, 6.52967775, 8.227401208, 9.65445625, 11.162575667, 15.239018417]
[0.114865875, 0.615687167, 1.425100083, 2.303357708, 3.495458833, 4.626406417, 5.232909541, 6.07731025, 6.841667083, 10.854387917]
[0.12395125, 0.615284, 1.395683833, 2.277328375, 3.433379792, 4.491984125, 5.240785166, 6.072688333, 6.843795709, 11.097065416]

mikmoore · April 7, 2025, 2:48pm

EDIT: the earlier post was mistaken and so everything here is not applicable

To springboard off of stevengj’s earlier post,
\min_b \|y - (X \otimes Z) b \|
This can be solved using the QR decomposition
Q_{XZ} R_{XZ} = X \otimes Z
b = R_{XZ}^{-1} Q_{XZ}^\top y
If we decompose
Q_{X} R_{X} = X \qquad Q_{Z} R_{Z} = Z
then we can use the properties of the Kronecker product to do the following algebra:
Q_{XZ} R_{XZ} = (Q_X \otimes Q_Z) (R_X \otimes R_Z)
b = (R_X^{-1} \otimes R_Z^{-1}) (Q_X^\top \otimes Q_Z^\top) y
b = (R_X^{-1} \otimes R_Z^{-1}) \text{vec}(Q_Z^\top \text{mat}(y) Q_X)
b = R_Z^{-1} Q_Z^\top \text{mat}(y) Q_X R_X^{-\top}
where \text{mat} reshapes a vector into a matrix that matches the adjoining dimensions.

We could compute this as written, but notice these QR terms have come back together into (roughly, or exactly in the square case) Z^{-1} and X^{-\top}. So in code we actually only need \ and / (which apply QR).
b = vec(Z \ reshape(y, size(Z,1), size(X,1)) / X')

Verifying this:

julia> using LinearAlgebra

julia> X = randn(10, 5); Z = randn(8, 6); y = randn(size(X,1) * size(Z,1));

julia> QX,RX = qr(X);

julia> QZ,RZ = qr(Z);

julia> b1 = kron(X, Z) \ y;

julia> b2 = vec(RZ \ (Matrix(QZ)' * reshape(y, size(Z,1), size(X,1)) * Matrix(QX)) / RX'); # Matrix to get the rectangular Q's instead of square

julia> b3 = vec(Z \ reshape(y, size(Z,1), size(X,1)) / X');

julia> norm(b1 - b2) / norm(b1)
1.5903193978874335e-15

julia> norm(b1 - b3) / norm(b1)
1.4172872362868884e-15

Unless these matrices are sparse or have other special structure, this direct solution is probably more effective than an iterative one. Or an iterative one based on this could be effective.

JADekker · April 7, 2025, 3:00pm

I see, that is interesting! But in my case, size(X) = (N, d), size(Z) = (N,K) and size(y) = N; I guess I’ll need to make y into a diagonal matrix here, right? (Edit: no, for my numbers the terminal crashes if I replace reshape(y, size(Z,1), size(X,1)) by Diagonal(y).)

mikmoore · April 7, 2025, 3:09pm

EDIT: this solves the wrong problem.

Adjusting for your exact case and the original y,

using LinearAlgebra
X = randn(10, 5); Z = randn(10, 6); y = randn(10);
QX,RX = qr(X);
QZ,RZ = qr(Z);

b1 = kron(X, Z) \ vec(Diagonal(y));
b2 = vec(RZ \ (Matrix(QZ)' * Diagonal(y) * Matrix(QX)) / RX'); # Matrix to get the rectangular Q's instead of square
b3 = vec(Z \ Diagonal(y) / X');
norm(b1 - b2) / norm(b1)
norm(b1 - b3) / norm(b1)

works for me.

JADekker · April 7, 2025, 3:13pm

Thank you! For me, the following parameters are realistic and make julia hang/take a long time with this QR approach:

function test2(N, d, k)
    X = rand(N, d)
    Z = rand(N, k)
    y = rand(N)

    b3 = vec(Z \ Diagonal(y) / X')
end
test2(100_000, 10, 100)

whereas the lsmr method above does not seem to have a problem there. If it is possible to get a direct QR approach to work, that would be very nice to me. In the end, I’ll need something like test2(100_000, 500, 100) to work, ideally within a few minutes (but the faster the better!) (lsmr works for the test2(100_000, 100, 100) case at least, haven’t tried the case with d=500 yet)

mikmoore · April 7, 2025, 3:27pm

The issue is that Z \ Diagonal(y) (or, more specifically, QZ \ Diagonal(y)) apparently does not have a specialized method for Diagonal(y), so it’s basically materializing Matrix(Diagonal(y)) which is waaaaay too big for your computer. But if we materialize Q eargerly with Matrix it works using the b2 approach:

using LinearAlgebra
function test3(N, d, k)
    X = rand(N, d)
    Z = rand(N, k)
    y = rand(N)
    QX,RX = qr(X)
    QZ,RZ = qr(Z)

    # tagging the R's as UpperTriangular can save a little work, but will fail unless N>=d && N>=k
    # some of these operations can be done in-place to save a little memory
    b2 = vec(RZ \ (Matrix(QZ)' * Diagonal(y) * Matrix(QX)) / RX'); # Matrix to get the rectangular Q's instead of square
end
test3(100_000, 10, 100); # to ensure things are compiled
@time test3(100_000, 10, 100); # 0.25s on my machine
@time test3(100_000, 500, 100); # 2.75s on my machine

JADekker · April 7, 2025, 5:00pm

Those speed-ups are very impressive! Unfortunately, I do not yet get the correct results using this method, perhaps there is a minor error. In a MWE:

using Random, Distributions, LinearAlgebra, IterativeSolvers, LinearMaps, BenchmarkTools, SparseArrays
Random.seed!(42)

function L(b::AbstractMatrix, X::AbstractMatrix, Z::AbstractMatrix)
    Base.require_one_based_indexing(b, X, Z)
    N = size(X,1)
    Y = zeros(promote_type(eltype(b), eltype(X), eltype(Z)), N)
    @inbounds for k in axes(X, 2), ℓ in axes(Z, 2);
        @simd for i in eachindex(Y)
            Y[i] += b[ℓ,k] * X[i,k] * Z[i,ℓ]
        end
    end
    return Y
end

function Lᵀ(Y::AbstractVector, X::AbstractMatrix, Z::AbstractMatrix)
    Base.require_one_based_indexing(Y, X, Z)
    n, m = size(X,2), size(Z,2)
    b = Matrix{promote_type(eltype(Y), eltype(X), eltype(Z))}(undef, m,n)
    @inbounds for k in axes(X, 2), ℓ in axes(Z, 2)
        b_ℓk = zero(eltype(Y))
        @simd for i in eachindex(Y)
            b_ℓk += Y[i] * X[i,k] * Z[i,ℓ]
        end
        b[ℓ,k] = b_ℓk 
    end
    return b
end

function test3(N, d, k)
    X = rand(N, d)
    Z = rand(N, k)

    XZ = Matrix{Float64}(undef, N, d*k)
    for di in 1:d
        XZ[:, 1 + (di - 1)*k:di*k] .= view(X, :, di) .* Z
    end

    y = XZ * rand(d*k) + randn(N)

    QX,RX = qr(X)
    QZ,RZ = qr(Z)

    # tagging the R's as UpperTriangular can save a little work, but will fail unless N>=d && N>=k
    # some of these operations can be done in-place to save a little memory
    b2 = vec(RZ \ (Matrix(QZ)' * Diagonal(y) * Matrix(QX)) / RX'); # Matrix to get the rectangular Q's instead of square
    display(mean((y - XZ * b2).^2))
    b3 = kron(X, Z) \ vec(Diagonal(y))
    display(mean((y - XZ * b3).^2))
    op = FunctionMap{Float64,false}(bvec -> L(reshape(bvec,k,d), X, Z), Y -> vec(Lᵀ(Y, X, Z)), N, k*d)
    b1 = lsqr(op, y)
    display(mean((y - XZ * b1).^2))
end

test3(100, 10, 10)

gives

147.52789836914693
147.5278983691469
0.018254857780580884

I think your approach should work perfectly for the Kronecker product, but I’m afraid the fact that my definition of XZ uses face-splitting instead of the usual Kronecker product, which makes the current version incompatible …

If there is a way to get this to work, I’m very interested in this, as I can see how the potential speed gains are tremendous compared to an iterative procedure!

(I’m not set on the specific way of defining XZ, it should just compute element-wise all pair-wise products of columns in X and Z)

mikmoore · April 7, 2025, 5:10pm

Shame. I was just starting from an earlier post suggesting that b = kron(X, Z) \ vec(Diagonal(Y)) would solve your problem. If that doesn’t work then I solved the wrong problem.

Topic		Replies	Views
Linear regression from scratch using QR decomposition General Usage linearalgebra , regression , qr	3	1139	November 12, 2021
Optimzing many linear solves Performance	2	267	June 14, 2023
Use QR factorization efficiently Performance linearalgebra	12	5200	January 27, 2018
Least squares in julia General Usage linearalgebra	7	666	September 26, 2023
QR factorization with custom dot product Numerics linearalgebra	5	1887	December 30, 2016

Best way for linear regression problem on product features

Related topics