In-place matrix operations slower?

I have the following two functions for linear regression.

function ordinary_least_squares_regression(x,y;lambda=0,weights=0)
    if weights == 0
        X = x' * x + lambda*I
        return X\(x'y)
    else
        w = spdiagm(weights)
        return ((x'*w*x + lambda*I))\(x'*w*y)
    end
end
function ordinary_least_squares_regression!(x,y,transpose_buffer,vector_buffer,matrix_buffer;lambda=0,weights=0)
    if typeof(weights) != Int
        for i in 1:size(x)[2]
            for j in 1:size(x)[1]
#Creates x' w matrix
                transpose_buffer[i,j] = x[j,i] * weights[j]
            end
        end
    else
#Stores x'
        transpose!(transpose_buffer,x)
    end
#Calculates x'y
    mul!(vector_buffer,transpose_buffer,y)
#Calculates x'x
    mul!(matrix_buffer,transpose_buffer,x)
    if lambda != 0
        for i in 1:size(matrix_buffer)[1]
#x'x + lambda I
            matrix_buffer[i,i] += lambda
        end
    end
    return matrix_buffer\vector_buffer
end

If I benchmark the two using the following buffers

x = randn(500000,8)
y = randn(500000)
mb = zeros(8,8)
vb = zeros(8)
tb = zeros(8,500000)
b = zeros(8)
w = collect(1:500000)
b1 = ordinary_least_squares_regression(x,y,weights=w)
b2 = ordinary_least_squares_regression!(x,y,tb,vb,mb,weights=w)

@btime for i in 1:20
    b1 = ordinary_least_squares_regression($x,$y,weights=$w)
end
@btime for i in 1:20 
    b2 = ordinary_least_squares_regression!($x,$y,$tb,$vb,$mb,weights=$w)
end

i.e., weighted linear regression, I get 887.315 ms (660 allocations: 1.42 GiB) 559.459 ms (60 allocations: 16.25 KiB); in-place linear regression is slightly faster. However, if you to the calculations with weights = 0 (i.e. no weights), then the result is 101.898 ms (120 allocations: 41.25 KiB) 405.156 ms (60 allocations: 16.25 KiB).

Question: why on earth is my in-place linear regression solver so much slower? Using a Cholesky decomposition doesn’t reduce allocations or memory usage, and the memory allocations are unavoidable as I’m solving a matrix equation. The two results agree to 1e-19 with or without weights, so I’m not doing anything wrong algorithmically.

(I have implemented other linear regression algorithms, such as coordinate descent and gradient descent, so for now I am solely interested in the relative slowness of this in-place implementation of the direct matrix solver)

1 Like

First thing I would try is to ditch the transpose_buffer, and just use x' which is a lazy operation. mul!(out, x', x) never instantiates the transpose, but calls a specialized method of mul!

1 Like

Second thing, although not yet an answer to the question: replace spdiagm(weights) with Diagonal(weights). Running the modified code, I can see that it cuts the allocations and time for the nonallocated case by half. Diagonal type is even more efficient than general sparse matrix for diagonal matrices.

1 Like

BTW, there’s no point in wrapping your code in loops inside @btime. BenchmarkTools does its own looping, just leave the job to them.

Note also that the convention in Julia is that functions with ! at the end of their name modify their inputs. They essentially rewrite some of the input arguments with their outputs. This is done with the motivation to save one more allocation. And this is not what is happening in your second function (the one with ! at the end of the name) – you are assigning the output to some temporary variable that must be allocated:

I slapped @profview from ProfileView onto the inplace version and found this:

Indeed, transpose! is dominating the flame graph.
Replacing

transpose!(transpose_buffer,x)

with

transpose_buffer = x'

gives me the following @btime results (I also removed the artificial for loop as suggested by @DNF)

julia> include("mwe_lzxnl.jl")
[ Info: v1_ordinary_least_squares_regression
  5.628 ms (32 allocations: 2.47 KiB)
[ Info: v2_ordinary_least_squares_regression
  5.627 ms (29 allocations: 1.22 KiB)

So at least on my machine and for this benchmark, the two versions are equivalent.


Here is the full script
using LinearAlgebra
using BenchmarkTools
using ProfileView
using SparseArrays

function v1_ordinary_least_squares_regression(x,y;lambda=0,weights=0)
    if weights == 0
        X = x' * x + lambda*I
        return X\(x'y)
    else
        w = spdiagm(weights)
        return ((x'*w*x + lambda*I))\(x'*w*y)
    end
end


function v2_ordinary_least_squares_regression!(x,y,transpose_buffer,vector_buffer,matrix_buffer;lambda=0,weights=0)
    if typeof(weights) != Int
        for i in 1:size(x)[2]
            for j in 1:size(x)[1]
                #Creates x' w matrix
                transpose_buffer[i,j] = x[j,i] * weights[j]
            end
        end
    else
        #Stores x'
        # transpose!(transpose_buffer,x)
        transpose_buffer = x'
    end
    #Calculates x'y
    mul!(vector_buffer,transpose_buffer,y)
    #Calculates x'x
    mul!(matrix_buffer,transpose_buffer,x)
    if lambda != 0
        for i in 1:size(matrix_buffer)[1]
            #x'x + lambda I
            matrix_buffer[i,i] += lambda
        end
    end
    return matrix_buffer\vector_buffer
end

let
    x = randn(500000,8)
    y = randn(500000)
    mb = zeros(8,8)
    vb = zeros(8)
    tb = zeros(8,500000)
    b = zeros(8)
    # w = collect(1:500000)
    w = 0
    b1 = v1_ordinary_least_squares_regression(x,y,weights=w)
    b2 = v2_ordinary_least_squares_regression!(x,y,tb,vb,mb,weights=w)

    @info "v1_ordinary_least_squares_regression"
    @btime begin
        v1_ordinary_least_squares_regression($x,$y,weights=$w)
    end
    @info "v2_ordinary_least_squares_regression"
    b2 = @btime begin
        v2_ordinary_least_squares_regression!($x,$y,$tb,$vb,$mb,weights=$w)
    end
    # @profview v2_ordinary_least_squares_regression!(x,y,tb,vb,mb,weights=w)
    # @code_warntype v2_ordinary_least_squares_regression!(x,y,tb,vb,mb,weights=w)
end
2 Likes

Thanks all for the replies! I implemented a diagonal method instead, and the non-allocating function is indeed faster. Good to know. Also it was good to know that it’s not necessary to store x'.

However, why is the in-place method not faster than the allocating method? Is the time taken by allocations swamped by the matrix inversion? That surprises me a little, to be honest.

Probably not too surprising as the allocations are quite small with your example numbers, i.e., X being 8 \times 8 and x'y being just 8 elements. Thus – without benchmarking – I would guess the runtime is probably dominated by the multiplication x'y itself.

Can you post your updated code? If the only difference is mul! instead of *, then the in-place version should indeed be faster, even if only marginally.

All good now! In-place is now faster, though not by much, and I think it’s because everything else just takes so much time.