Speeding up matrix multiplication with a subset of matrix's rows

kernelmethod · July 24, 2019, 5:19am

Hi all,

Pretty straightforward – I’d like to write some code that performs a matvec between a matrix A and a vector x, except I know ahead of time that I don’t need all of the rows of A. In other words, instead of performing A*x, I only really need to compute A[rows,:] * x.

In practice, though, the former is much, much faster to compute than the latter:

julia> using BenchmarkTools

julia> W = randn(256, 256);

julia> rows = findall(randn(256) .≥ 0);  # random subset of ≈ 1/2 of the rows of W

julia> @benchmark W*x setup=(x=randn(256))
BenchmarkTools.Trial: 
  memory estimate:  2.13 KiB
  allocs estimate:  1
  --------------
  minimum time:     7.619 μs (0.00% GC)
  median time:      9.685 μs (0.00% GC)
  mean time:        17.447 μs (35.49% GC)
  maximum time:     62.014 ms (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark W[rows,:]*x setup=(x=randn(256))
BenchmarkTools.Trial: 
  memory estimate:  267.28 KiB
  allocs estimate:  5
  --------------
  minimum time:     86.810 μs (0.00% GC)
  median time:      98.906 μs (0.00% GC)
  mean time:        118.478 μs (9.77% GC)
  maximum time:     62.002 ms (99.51% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark view(W,rows,:)*x setup=(x=randn(256))  # Try using view to avoid copying data?
BenchmarkTools.Trial: 
  memory estimate:  1.25 KiB
  allocs estimate:  4
  --------------
  minimum time:     118.659 μs (0.00% GC)
  median time:      124.240 μs (0.00% GC)
  mean time:        127.680 μs (0.72% GC)
  maximum time:     9.343 ms (98.37% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark(  # Try @inbounds?
           begin
               @inbounds W[rows,:]*x
           end,
           setup=(x=randn(256))
       )
BenchmarkTools.Trial: 
  memory estimate:  271.28 KiB
  allocs estimate:  5
  --------------
  minimum time:     94.149 μs (0.00% GC)
  median time:      97.437 μs (0.00% GC)
  mean time:        115.970 μs (9.84% GC)
  maximum time:     61.165 ms (99.49% GC)
  --------------
  samples:          10000
  evals/sample:     1

Unfortunately, for my application I can’t simply store W[rows,:] in a variable since rows changes frequently.

I’ve tried other things to speed this computation up, e.g. storing V = Matrix(W') and performing V[:,rows]' * x instead of W[rows,:] * x so that I’m indexing into columns rather than rows, and this gives a decent speedup:

julia> V = Matrix(W')

julia> @benchmark W[rows,:]*x setup=(x=randn(256))
BenchmarkTools.Trial: 
  memory estimate:  271.28 KiB
  allocs estimate:  5
  --------------
  minimum time:     81.037 μs (0.00% GC)
  median time:      100.251 μs (0.00% GC)
  mean time:        117.961 μs (10.63% GC)
  maximum time:     62.720 ms (99.51% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark V[:,rows]'*x setup=(x=randn(256))
BenchmarkTools.Trial: 
  memory estimate:  271.34 KiB
  allocs estimate:  8
  --------------
  minimum time:     47.782 μs (0.00% GC)
  median time:      64.861 μs (0.00% GC)
  mean time:        82.015 μs (15.47% GC)
  maximum time:     62.494 ms (99.56% GC)
  --------------
  samples:          10000
  evals/sample:     1

Still, it’s way slower than computing W*x.

Does anybody have any suggestions for how to speed up this computation? I would greatly appreciate any input you can provide!

Thanks!

dkarrasch · July 24, 2019, 8:07am

The reason is that you’re including the runtime of the matrix creation W[rows,:] in the measurement. If you do

@benchmark $(W[rows, :]) * x setup=(x=randn(256))

then you get the runtime for pure multiplication, which is smaller than W * x. Anyway, in your code, you do have the matrix generation effort, so it’s probably relevant. In that case, as you have done, you should consider using views. Unfortunately, matrix multiplication falls back to the generic one, which is much slower, because view(W, rows, :) is not a StridedArray anymore, and matrix multiplication is no longer performed by BLAS. It seems like there is a trade-off between saving the allocation for W[rows,:] by views on the one hand, and the speed of matrix multiplication on the other hand.

dkarrasch · July 24, 2019, 8:16am

You were asking for suggestions to speed up the code, and I was talking blabla. Since you seem to have W in memory anyway, you may consider simply computing (W * x)[rows] (or a view on it), since I don’t think there is any way to make W[rows, :] strided and hence multiplication of the restriction be as fast. But when you have W in your memory, why generate a partial copy W[rows, :] to create a strided matrix again? The runtime of that is what you measure by @benchmark W[rows,:]*x setup=(x=randn(256)), so it’s a factor of more than 10. While it seems you’re doing lots of redundant operations with W * x, it’s just a factor of 2 (in your example), but notably performed by a very fast algorithm.

kernelmethod · July 24, 2019, 5:30pm

Thanks for the reply!

As you mentioned, I can’t really get around having to allocate (or get a view of) W[rows,:], since for my project rows is dependent on the vector x and hence is subject to change. So a more realistic benchmark might be

julia> @benchmark W[rows,:]*x setup=(x=randn(256); rows=findall(randn(256) .≥ 0))
BenchmarkTools.Trial: 
  memory estimate:  199.02 KiB
  allocs estimate:  5
  --------------
  minimum time:     75.053 μs (0.00% GC)
  median time:      92.853 μs (0.00% GC)
  mean time:        114.051 μs (13.20% GC)
  maximum time:     64.591 ms (99.54% GC)
  --------------
  samples:          10000
  evals/sample:     1

In any case, you can see why @benchmark $(W[rows,:])*x setup=(x=randn(256)) doesn’t really reflect the actual performance of my code.

If I understand correctly, it sounds like the benefits / drawbacks between my three options are:

Use (W * x)[rows], which gets all the benefits of BLAS but performs more computation than theoretically necessary;
use W[rows,:] * x, which also hits BLAS but requires us to first allocate W[rows,:]; and
use view(W, rows, :) * x, which avoids the memory allocation but no longer hits BLAS.

Although my example only dropped ~1/2 of the outputs of W*x, for my actual application I’m dropping closer to 80% - 90% of the rows. It turns out that dropping even 90% of the rows is still slower than BLAS; I only (barely) get a speedup if I store V = Matrix(W') first.

julia> rows = findall(rand(256) .≥ 0.90);  # Drop ~90% of the rows of W

julia> @benchmark W[rows,:]*x setup=(x=randn(256))
BenchmarkTools.Trial: 
  memory estimate:  70.50 KiB
  allocs estimate:  5
  --------------
  minimum time:     16.198 μs (0.00% GC)
  median time:      16.937 μs (0.00% GC)
  mean time:        23.670 μs (25.71% GC)
  maximum time:     38.846 ms (99.78% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark view(W, rows, :)*x setup=(x=randn(256))
BenchmarkTools.Trial: 
  memory estimate:  480 bytes
  allocs estimate:  4
  --------------
  minimum time:     33.946 μs (0.00% GC)
  median time:      34.193 μs (0.00% GC)
  mean time:        35.798 μs (2.73% GC)
  maximum time:     9.864 ms (99.23% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> V = Matrix(W');

julia> @benchmark V[:,rows]'*x setup=(x=randn(256))
BenchmarkTools.Trial: 
  memory estimate:  70.56 KiB
  allocs estimate:  8
  --------------
  minimum time:     8.941 μs (0.00% GC)
  median time:      9.559 μs (0.00% GC)
  mean time:        16.920 μs (36.62% GC)
  maximum time:     39.514 ms (99.84% GC)
  --------------
  samples:          10000
  evals/sample:     1

At this point, I might just try (ugh) writing my own function in C that avoids the allocation of W[rows,:], but is hopefully still faster than view(W,rows,:) * x to perform this matvec. Even though it won’t be as performant as BLAS, the number of rows that I’m dropping combined with the reduced memory allocation might be sufficient to beat the examples above.

stevengj · July 24, 2019, 6:02pm

C has no advantage over pure Julia code for this, so you can write your optimized submatrix multiply in Julia.

For example, here are some benchmarks and discussions of optimizing matrix-matrix multiplications from my course. Among other things, it compares a naive 3-loop implementation in Julia and C and shows that the speed is almost identical:

https://nbviewer.jupyter.org/github/mitmath/18335/blob/master/notes/Memory-and-Matrices.ipynb

dkarrasch · July 24, 2019, 6:35pm

I guess you should include this into your runtime measurement, if that’s going to be part of your code. Then I’m no longer sure if this option is advantageous anymore.

kernelmethod · July 24, 2019, 7:49pm

Heh, good point – for those of us recovering from Python, it’s easy to forget that iteration in Julia is roughly as fast as iteration in C.

Considering that, this version seems to work quite a bit better:

using BenchmarkTools

# Compute W*x on a subset of the rows of W
function partial_matvec(W::AbstractMatrix{T}, x, rows) where {T}
	output = zeros(T, length(x))
	for jj = 1:length(x)
		for ii in rows
			@inbounds output[ii] += W[ii,jj] * x[jj]
		end
	end
	return output
end

W = randn(256, 256);

# Check that partial_matvec works correctly
x = randn(256)
rows = findall(randn(256) .≥ 0)
@assert all((W*x)[rows] .≈ partial_matvec(W, x, rows)[rows])

setup=:(x=randn(256); rows=findall(randn(256) .≥ 0))
display(@benchmark W*x setup=$(setup))
println()
display(@benchmark partial_matvec(W,x,rows) setup=$(setup))
println()

Output:

BenchmarkTools.Trial: 
  memory estimate:  3.02 KiB
  allocs estimate:  3
  --------------
  minimum time:     8.164 μs (0.00% GC)
  median time:      10.406 μs (0.00% GC)
  mean time:        13.987 μs (7.65% GC)
  maximum time:     10.760 ms (99.48% GC)
  --------------
  samples:          10000
  evals/sample:     1

BenchmarkTools.Trial: 
  memory estimate:  2.13 KiB
  allocs estimate:  1
  --------------
  minimum time:     27.727 μs (0.00% GC)
  median time:      35.975 μs (0.00% GC)
  mean time:        43.615 μs (0.00% GC)
  maximum time:     153.929 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Still slower than W*x, but it’s an improvement – I’ll have to do some research into optimizing these handwritten matrix multiplication routines. The notebook you linked looks like a good place to start, so thank you for providing that.

Well, my thinking was that if I need to compute

W[rows1,:]*x1
W[rows2,:]*x2
...
W[rowsn,:]*xn

then before I do that, I can run V = Matrix(W')', which ends up being a one-time cost and hence not something I really need to include in the benchmark. But in the handwritten multiply I provided in response to @stevengj it’s not relevant since I’m now accessing W in column order.

Sean_McBane · July 24, 2019, 7:56pm

I suspect you could speed this up further by doing what you suggested with a one-time V = Matrix(W')' and going in row major order, since then each entry is just an inner product of the row with x and all of the entries in the row are adjacent. Skipping over lots of entries in each column is probably the biggest slowdown in the version you’ve shown but I can’t think of a better way to do it in column-major off the top of my head.

StefanKarpinski · July 24, 2019, 8:13pm

Is there any pattern within or across the sets of rows that you need to select?

under-Peter · July 24, 2019, 9:09pm

Here are my attempts:

using LinearAlgebra

function foo(w,v,rows)
    out = zeros(eltype(v), length(rows))
    tmp = similar(w, length(rows))
    for j in axes(w,2)
        for (i,r) in enumerate(rows)
            @inbounds tmp[i] = w[r,j]
        end
        @inbounds axpy!(v[j], tmp, out)
    end
    return out
end

function bar(wt,v,rows)
    out = similar(v, length(rows))
    tmp = similar(v)
    @inbounds for (i,r) in enumerate(rows)
        copyto!(tmp,1,wt,(r-1)*size(wt,1)+1,length(tmp))
        out[i] = dot(tmp, v)
    end
    return out
end

They might be useful. Bar is faster provided that the matrix transpose is given (upfront cost) while foo has a smaller memory footprint since it acts on vectors of length of rows.

Hope that helps, in my timings foo is faster than the regular w*v if rows is short and bar uses way less memory. Ymmv

kernelmethod · July 24, 2019, 9:24pm

I’ve tried this in a couple of ways, so far it doesn’t help much (if at all). E.g. here’s one example (I don’t try to vectorize the inner loop using something like dot because I haven’t found an efficient way to use it that doesn’t require allocating memory):

julia> function partial_matvec(W::AbstractMatrix{T}, x, rows) where {T}
           output = zeros(T, size(W,1))
           for ii in rows
               for jj = 1:length(x)
                   @inbounds output[ii] += W[ii,jj] * x[jj]
               end
           end
           return output
       end
partial_matvec (generic function with 1 method)

julia> V = Matrix(W')';

julia> @benchmark partial_matvec(V, x, rows) setup=(x=randn(256); rows=findall(randn(256) .≥ 0))
BenchmarkTools.Trial: 
  memory estimate:  2.13 KiB
  allocs estimate:  1
  --------------
  minimum time:     30.741 μs (0.00% GC)
  median time:      41.568 μs (0.00% GC)
  mean time:        41.624 μs (0.00% GC)
  maximum time:     130.366 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Unfortunately not, the rows are more or less sampled randomly.

kernelmethod · July 24, 2019, 9:45pm

Nice, thank you! bar is definitely the fastest one compared to W*x that I’ve tried so far. I’ll need to experiment with this a little more to see how much time I can shave off.

kernelmethod · July 24, 2019, 11:23pm

Okay, I think I’ve got it. I took @under-Peter’s idea and substituted the memory copy + vectorization with dot with a hand-written dot product implementation:

using LinearAlgebra

# Disable BLAS multithreading for a fairer comparison
BLAS.set_num_threads(1)

"""
Compute a matrix-vector product `W[rows,:] * x`. Assume that `Wt == W'`.
"""
function partial_matvec(Wt::AbstractMatrix{T}, x, rows) where {T}
	output = zeros(T, length(rows))

	@inbounds for ii = 1:length(rows)
		row = rows[ii]

		# Compute dot product of the current row of W with x
		result = T(0)
		@simd for col = 1:length(x)
			result += Wt[col,row] * x[col]
		end

		output[ii] = result
	end

	return output
end

# Check that partial_matvec and bar work correctly
W = randn(256, 256);
Wt = Matrix(W');
x = randn(256);
rows = findall(rand(256) .≥ 0.5)
@assert (W*x)[rows] ≈ partial_matvec(Wt,x,rows)

Here are results for a 256\times 256 matrix, dropping \approx 1/2 of the rows of W:

julia> W = randn(256, 256); Wt = Matrix(W');

julia> @benchmark W*x setup=(x=randn(256))
BenchmarkTools.Trial: 
  memory estimate:  2.13 KiB
  allocs estimate:  1
  --------------
  minimum time:     10.726 μs (0.00% GC)
  median time:      11.018 μs (0.00% GC)
  mean time:        14.682 μs (24.39% GC)
  maximum time:     35.868 ms (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark partial_matvec(Wt, x, rows) setup=(x=randn(256); rows=findall(rand(256) .≥ 0.5))
BenchmarkTools.Trial: 
  memory estimate:  896 bytes
  allocs estimate:  1
  --------------
  minimum time:     4.155 μs (0.00% GC)
  median time:      5.827 μs (0.00% GC)
  mean time:        5.858 μs (0.00% GC)
  maximum time:     11.197 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

And here’s output from a run with a 1024\times 1024 matrix:

julia> W = randn(1024, 1024); Wt = Matrix(W');

julia> @benchmark W*x setup=(x=randn(1024))
BenchmarkTools.Trial: 
  memory estimate:  8.13 KiB
  allocs estimate:  1
  --------------
  minimum time:     303.350 μs (0.00% GC)
  median time:      314.284 μs (0.00% GC)
  mean time:        318.422 μs (0.02% GC)
  maximum time:     894.475 μs (60.18% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark partial_matvec(Wt,x,rows) setup=(x=randn(1024);rows=findall(rand(1024) .≥ 0.5))
BenchmarkTools.Trial: 
  memory estimate:  3.69 KiB
  allocs estimate:  1
  --------------
  minimum time:     172.821 μs (0.00% GC)
  median time:      194.660 μs (0.00% GC)
  mean time:        195.771 μs (0.03% GC)
  maximum time:     974.548 μs (63.24% GC)
  --------------
  samples:          10000
  evals/sample:     1

I ran into all kinds of interesting problems with this one, e.g. if I set result = 0 instead of result = T(0), performance became roughly 4x to 8x worse. Ditto if I tried to replace updates to result in the inner loop by updates to output[ii], e.g. output[ii] += Wt[col,row] * x[col].

Regardless, it seems like it’s working now. Thank you to everyone who contributed, I appreciate the help!

Topic		Replies	Views
Simple Mat-Vec multiply (understanding performance, without the bugs) Performance tullio	16	3269	August 12, 2020
Submatrix multiply Performance	7	973	February 29, 2020
Matrix-vector multiplication slower than a 'naive' for loop? Performance vector	7	1628	July 30, 2020
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	413	November 7, 2022
Performance of sparse matrix multiplied with a reshaped view New to Julia	2	383	August 13, 2020

Speeding up matrix multiplication with a subset of matrix's rows

Related topics