Efficiently compute X[sub,:]*X[sub,:]'

maximerischard · February 5, 2019, 5:20pm

X is a matrix and sub is a subset of columns. I’m trying to find the most efficient way to compute X’X for the submatrix. What I find is that first allocating X[sub,:] is significantly faster than using view(X,sub,:), but the extra memory allocation is wasteful. Is there a better way to do this? I can preallocate a buffer array but it doesn’t seem like that should be necessary.

using BenchmarkTools
using LinearAlgebra
using StatsBase: sample

function XXtview(Xt::Matrix{Float64}, sub::Vector{Int})
    Xt_sub = view(Xt, sub, :)
    XXt = Xt_sub * Xt_sub'
    return XXt
end

function XXtcopy(Xt::Matrix{Float64}, sub::Vector{Int})
    Xt_sub = Xt[sub, :]
    XXt = Xt_sub * Xt_sub'
    return XXt
end

function XXtbuffer(Xt::Matrix{Float64}, sub::Vector{Int}, Xt_sub::Matrix{Float64})
    copyto!(Xt_sub, @view(Xt[sub, :]))
    XXt = Xt_sub * Xt_sub'
    return XXt
end

n, p = 1000, 500
nsub = 10
Xt = randn(p, n)
sub = sample(1:p, nsub)
Xt_sub = randn(nsub, n)
@assert all(XXtcopy(Xt, sub) .≈ XXtview(Xt, sub))
@assert all(XXtcopy(Xt, sub) .≈ XXtbuffer(Xt, sub, Xt_sub))

@benchmark XXtview($Xt, $sub)

BenchmarkTools.Trial: 
  memory estimate:  1.31 KiB
  allocs estimate:  10
  --------------
  minimum time:     99.061 μs (0.00% GC)
  median time:      99.882 μs (0.00% GC)
  mean time:        101.643 μs (0.00% GC)
  maximum time:     210.991 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

@benchmark XXtcopy($Xt, $sub)

BenchmarkTools.Trial: 
  memory estimate:  79.14 KiB
  allocs estimate:  5
  --------------
  minimum time:     40.986 μs (0.00% GC)
  median time:      47.674 μs (0.00% GC)
  mean time:        61.212 μs (22.19% GC)
  maximum time:     54.852 ms (99.72% GC)
  --------------
  samples:          10000
  evals/sample:     1

@benchmark XXtbuffer($Xt, $sub, $Xt_sub)

BenchmarkTools.Trial: 
  memory estimate:  1008 bytes
  allocs estimate:  4
  --------------
  minimum time:     44.912 μs (0.00% GC)
  median time:      45.849 μs (0.00% GC)
  mean time:        49.861 μs (0.00% GC)
  maximum time:     284.885 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

mbauman · February 5, 2019, 6:30pm

In a view like @view Xt[sample(1:end, n), :], you’re rearranging the rows such that consecutive indices are no longer laid out consecutively (or even consistently) in memory. This has a huge effect on the speed of iterating over the matrix — and an even bigger effect when you can potentially call BLAS’ super-machine-optimized matrix multiplication routines that are tuned for consecutive memory accesses. So, yeah, making a copy (or copying into a buffer) such that things are in order is definitely going to give you the best performance.

You can also look into mul! to preallocate the output, too.

Topic		Replies	Views
Optimizing iteration over slices of multiple matrices Performance question , array	14	1881	December 30, 2018
Inplace multiplication of sub-matrices without allocations Performance performance , linear-algebra , allocations	12	1715	December 12, 2022
Excessive Memory Allocation? Performance question , memory-allocation	6	412	January 16, 2024
Speeding up operations on large arrays Performance	6	506	December 21, 2022
Assistance in avoiding allocations in summations Performance	8	778	March 18, 2019

Efficiently compute X[sub,:]*X[sub,:]'

Related topics