# Efficiently compute X[sub,:]*X[sub,:]'

X is a matrix and sub is a subset of columns. I’m trying to find the most efficient way to compute X’X for the submatrix. What I find is that first allocating X[sub,:] is significantly faster than using `view(X,sub,:)`, but the extra memory allocation is wasteful. Is there a better way to do this? I can preallocate a buffer array but it doesn’t seem like that should be necessary.

``````using BenchmarkTools
using LinearAlgebra
using StatsBase: sample

function XXtview(Xt::Matrix{Float64}, sub::Vector{Int})
Xt_sub = view(Xt, sub, :)
XXt = Xt_sub * Xt_sub'
return XXt
end

function XXtcopy(Xt::Matrix{Float64}, sub::Vector{Int})
Xt_sub = Xt[sub, :]
XXt = Xt_sub * Xt_sub'
return XXt
end

function XXtbuffer(Xt::Matrix{Float64}, sub::Vector{Int}, Xt_sub::Matrix{Float64})
copyto!(Xt_sub, @view(Xt[sub, :]))
XXt = Xt_sub * Xt_sub'
return XXt
end

n, p = 1000, 500
nsub = 10
Xt = randn(p, n)
sub = sample(1:p, nsub)
Xt_sub = randn(nsub, n)
@assert all(XXtcopy(Xt, sub) .≈ XXtview(Xt, sub))
@assert all(XXtcopy(Xt, sub) .≈ XXtbuffer(Xt, sub, Xt_sub))

@benchmark XXtview(\$Xt, \$sub)
``````
``````BenchmarkTools.Trial:
memory estimate:  1.31 KiB
allocs estimate:  10
--------------
minimum time:     99.061 μs (0.00% GC)
median time:      99.882 μs (0.00% GC)
mean time:        101.643 μs (0.00% GC)
maximum time:     210.991 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````
``````@benchmark XXtcopy(\$Xt, \$sub)
``````
``````BenchmarkTools.Trial:
memory estimate:  79.14 KiB
allocs estimate:  5
--------------
minimum time:     40.986 μs (0.00% GC)
median time:      47.674 μs (0.00% GC)
mean time:        61.212 μs (22.19% GC)
maximum time:     54.852 ms (99.72% GC)
--------------
samples:          10000
evals/sample:     1
``````
``````@benchmark XXtbuffer(\$Xt, \$sub, \$Xt_sub)
``````
``````BenchmarkTools.Trial:
memory estimate:  1008 bytes
allocs estimate:  4
--------------
minimum time:     44.912 μs (0.00% GC)
median time:      45.849 μs (0.00% GC)
mean time:        49.861 μs (0.00% GC)
maximum time:     284.885 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

In a view like `@view Xt[sample(1:end, n), :]`, you’re rearranging the rows such that consecutive indices are no longer laid out consecutively (or even consistently) in memory. This has a huge effect on the speed of iterating over the matrix — and an even bigger effect when you can potentially call BLAS’ super-machine-optimized matrix multiplication routines that are tuned for consecutive memory accesses. So, yeah, making a copy (or copying into a buffer) such that things are in order is definitely going to give you the best performance.

You can also look into `mul!` to preallocate the output, too.

5 Likes