In-place matrix operations slower?

lzxnl · July 11, 2023, 4:18am

I have the following two functions for linear regression.

function ordinary_least_squares_regression(x,y;lambda=0,weights=0)
    if weights == 0
        X = x' * x + lambda*I
        return X\(x'y)
    else
        w = spdiagm(weights)
        return ((x'*w*x + lambda*I))\(x'*w*y)
    end
end

function ordinary_least_squares_regression!(x,y,transpose_buffer,vector_buffer,matrix_buffer;lambda=0,weights=0)
    if typeof(weights) != Int
        for i in 1:size(x)[2]
            for j in 1:size(x)[1]
#Creates x' w matrix
                transpose_buffer[i,j] = x[j,i] * weights[j]
            end
        end
    else
#Stores x'
        transpose!(transpose_buffer,x)
    end
#Calculates x'y
    mul!(vector_buffer,transpose_buffer,y)
#Calculates x'x
    mul!(matrix_buffer,transpose_buffer,x)
    if lambda != 0
        for i in 1:size(matrix_buffer)[1]
#x'x + lambda I
            matrix_buffer[i,i] += lambda
        end
    end
    return matrix_buffer\vector_buffer
end

If I benchmark the two using the following buffers

x = randn(500000,8)
y = randn(500000)
mb = zeros(8,8)
vb = zeros(8)
tb = zeros(8,500000)
b = zeros(8)
w = collect(1:500000)
b1 = ordinary_least_squares_regression(x,y,weights=w)
b2 = ordinary_least_squares_regression!(x,y,tb,vb,mb,weights=w)

@btime for i in 1:20
    b1 = ordinary_least_squares_regression($x,$y,weights=$w)
end
@btime for i in 1:20 
    b2 = ordinary_least_squares_regression!($x,$y,$tb,$vb,$mb,weights=$w)
end

i.e., weighted linear regression, I get 887.315 ms (660 allocations: 1.42 GiB) 559.459 ms (60 allocations: 16.25 KiB); in-place linear regression is slightly faster. However, if you to the calculations with weights = 0 (i.e. no weights), then the result is 101.898 ms (120 allocations: 41.25 KiB) 405.156 ms (60 allocations: 16.25 KiB).

Question: why on earth is my in-place linear regression solver so much slower? Using a Cholesky decomposition doesn’t reduce allocations or memory usage, and the memory allocations are unavoidable as I’m solving a matrix equation. The two results agree to 1e-19 with or without weights, so I’m not doing anything wrong algorithmically.

(I have implemented other linear regression algorithms, such as coordinate descent and gradient descent, so for now I am solely interested in the relative slowness of this in-place implementation of the direct matrix solver)

DNF · July 11, 2023, 7:12am

First thing I would try is to ditch the transpose_buffer, and just use x' which is a lazy operation. mul!(out, x', x) never instantiates the transpose, but calls a specialized method of mul!

zdenek_hurak · July 11, 2023, 7:52am

Second thing, although not yet an answer to the question: replace spdiagm(weights) with Diagonal(weights). Running the modified code, I can see that it cuts the allocations and time for the nonallocated case by half. Diagonal type is even more efficient than general sparse matrix for diagonal matrices.

DNF · July 11, 2023, 8:08am

lzxnl:

@btime for i in 1:20
    b1 = ordinary_least_squares_regression($x,$y,weights=$w)
end
@btime for i in 1:20 
    b2 = ordinary_least_squares_regression!($x,$y,$tb,$vb,$mb,weights=$w)
end

BTW, there’s no point in wrapping your code in loops inside @btime. BenchmarkTools does its own looping, just leave the job to them.

zdenek_hurak · July 11, 2023, 9:23am

Note also that the convention in Julia is that functions with ! at the end of their name modify their inputs. They essentially rewrite some of the input arguments with their outputs. This is done with the motivation to save one more allocation. And this is not what is happening in your second function (the one with ! at the end of the name) – you are assigning the output to some temporary variable that must be allocated:

fatteneder · July 11, 2023, 11:41am

I slapped @profview from ProfileView onto the inplace version and found this:

Indeed, transpose! is dominating the flame graph.
Replacing

transpose!(transpose_buffer,x)

with

transpose_buffer = x'

gives me the following @btime results (I also removed the artificial for loop as suggested by @DNF)

julia> include("mwe_lzxnl.jl")
[ Info: v1_ordinary_least_squares_regression
  5.628 ms (32 allocations: 2.47 KiB)
[ Info: v2_ordinary_least_squares_regression
  5.627 ms (29 allocations: 1.22 KiB)

So at least on my machine and for this benchmark, the two versions are equivalent.

Here is the full script

using LinearAlgebra
using BenchmarkTools
using ProfileView
using SparseArrays

function v1_ordinary_least_squares_regression(x,y;lambda=0,weights=0)
    if weights == 0
        X = x' * x + lambda*I
        return X\(x'y)
    else
        w = spdiagm(weights)
        return ((x'*w*x + lambda*I))\(x'*w*y)
    end
end


function v2_ordinary_least_squares_regression!(x,y,transpose_buffer,vector_buffer,matrix_buffer;lambda=0,weights=0)
    if typeof(weights) != Int
        for i in 1:size(x)[2]
            for j in 1:size(x)[1]
                #Creates x' w matrix
                transpose_buffer[i,j] = x[j,i] * weights[j]
            end
        end
    else
        #Stores x'
        # transpose!(transpose_buffer,x)
        transpose_buffer = x'
    end
    #Calculates x'y
    mul!(vector_buffer,transpose_buffer,y)
    #Calculates x'x
    mul!(matrix_buffer,transpose_buffer,x)
    if lambda != 0
        for i in 1:size(matrix_buffer)[1]
            #x'x + lambda I
            matrix_buffer[i,i] += lambda
        end
    end
    return matrix_buffer\vector_buffer
end

let
    x = randn(500000,8)
    y = randn(500000)
    mb = zeros(8,8)
    vb = zeros(8)
    tb = zeros(8,500000)
    b = zeros(8)
    # w = collect(1:500000)
    w = 0
    b1 = v1_ordinary_least_squares_regression(x,y,weights=w)
    b2 = v2_ordinary_least_squares_regression!(x,y,tb,vb,mb,weights=w)

    @info "v1_ordinary_least_squares_regression"
    @btime begin
        v1_ordinary_least_squares_regression($x,$y,weights=$w)
    end
    @info "v2_ordinary_least_squares_regression"
    b2 = @btime begin
        v2_ordinary_least_squares_regression!($x,$y,$tb,$vb,$mb,weights=$w)
    end
    # @profview v2_ordinary_least_squares_regression!(x,y,tb,vb,mb,weights=w)
    # @code_warntype v2_ordinary_least_squares_regression!(x,y,tb,vb,mb,weights=w)
end

lzxnl · July 12, 2023, 1:07am

Thanks all for the replies! I implemented a diagonal method instead, and the non-allocating function is indeed faster. Good to know. Also it was good to know that it’s not necessary to store x'.

However, why is the in-place method not faster than the allocating method? Is the time taken by allocations swamped by the matrix inversion? That surprises me a little, to be honest.

bertschi · July 12, 2023, 6:56am

Probably not too surprising as the allocations are quite small with your example numbers, i.e., X being 8 \times 8 and x'y being just 8 elements. Thus – without benchmarking – I would guess the runtime is probably dominated by the multiplication x'y itself.

DNF · July 12, 2023, 7:54am

Can you post your updated code? If the only difference is mul! instead of *, then the in-place version should indeed be faster, even if only marginally.

lzxnl · July 14, 2023, 3:23am

All good now! In-place is now faster, though not by much, and I think it’s because everything else just takes so much time.

Topic		Replies	Views
Non-allocating matrix inversion Performance matrices	13	3575	April 23, 2022
Asymmetric speed of in-place `sparse*dense` matrix product General Usage	7	1531	November 8, 2018
Inverse with destination? General Usage	35	4244	May 19, 2021
Inplace multiplication by a square matrix General Usage	11	7512	February 17, 2017
Allocation when applying in place multiplication to matrix of SVector Performance question	2	192	March 15, 2023

In-place matrix operations slower?

Related topics