Submatrix multiply

hytonwons · February 28, 2020, 11:55pm

Hi, my algorithm involves a lot of submatrix multiply. I got inspired by the use of view, but the speedup of doing so is not that satisfying. Below is some sample code:

If we just try slicing the matrix, the speedup is pretty significant:

julia> A = randn(10000,10000);
julia> @time @view A[:1:5000];
  0.000004 seconds (8 allocations: 320 bytes)

julia> @time A[:1:5000];
  0.013570 seconds (22.51 k allocations: 1.085 MiB)

However, with the same dimension, if we do a submatrix multiply, I got this:

julia> B = randn(5000,5000);

julia> @time A[:,1:5000] =  A[:, 1:5000]*B;
  1.834928 seconds (14 allocations: 762.940 MiB)

julia> @time @views A[:,1:5000] =  A[:, 1:5000]*B;
  1.667780 seconds (16 allocations: 381.470 MiB, 1.64% gc time)

which is not so different. Further, if we increase the size of the matrix,

julia> A = randn(50000,50000);

julia> B = randn(10000,10000);

julia> @time A[:,1:10000] = A[:,1:10000]*B;
 82.754603 seconds (14 allocations: 7.451 GiB, 1.46% gc time)

julia> @time @views A[:,1:10000] = A[:,1:10000]*B;
 56.483068 seconds (16 allocations: 3.725 GiB, 1.44% gc time)

the speedup is a bit better.

I’m wondering if this speedup looks normal? In general, what level of speedup should I expect?
Any help is appreciated!

Oscar_Smith · February 29, 2020, 12:12am

You will almost never get good results by making a view and then using it to do unaligned array accesses. This is because in there cases, all of your accesses are cache misses, so it ends up being as slow as just copying the data

baggepinnen · February 29, 2020, 12:45am

There are problems with the way you benchmark, have a look at BenchmarkTools.jl to make sure you make the correct inferences about timings.

hytonwons · February 29, 2020, 2:16am

Why is it unaligned? I thought in Julia matices are stored in a column-major fashion? The sub-columns are stored consecutively in mem or cache?

hytonwons · February 29, 2020, 2:20am

But @time is natively supported by Julia, I didn’t use BenchmarkTools at all.

baggepinnen · February 29, 2020, 2:32am

Indeed it is, but you will make false conclusions if using it in the way you did. For instance, the variables you use are global, a huge detriment to performance if benchmarked with @time

Elrod · February 29, 2020, 5:17am

I don’t think it’ll make much difference at the scale of 5000 x 5000 matrices.

That said, I see a pretty notable difference between @benchmark mul!($C, $A, $B) and A * B, and a very large difference with and without views. The views are almost equally fast as not slicing.
BenchmarkTools helps with is noise, and not timing compilation.
My results:

julia> using BenchmarkTools, LinearAlgebra

julia> A = rand(5000, 5000); B = rand(5000, 5000); C = similar(A);

julia> @time A * B;
  0.152603 seconds (2 allocations: 190.735 MiB, 6.12% gc time)

julia> @time A * B;
  0.144549 seconds (2 allocations: 190.735 MiB)

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     123.864 ms (0.00% GC)
  median time:      125.252 ms (0.00% GC)
  mean time:        125.388 ms (0.00% GC)
  maximum time:     127.998 ms (0.00% GC)
  --------------
  samples:          40
  evals/sample:     1

julia> @time A * B[:,1:5000];
  0.233827 seconds (162.54 k allocations: 389.686 MiB)

julia> @time A * B[:,1:5000];
  0.211714 seconds (6 allocations: 381.470 MiB, 7.11% gc time)

julia> @time @views A * B[:,1:5000];
  0.589579 seconds (2.65 M allocations: 308.709 MiB, 2.86% gc time)

julia> @time @views A * B[:,1:5000];
  0.145052 seconds (8 allocations: 190.735 MiB)

julia> @time A * B[:,1:5000];
  0.203089 seconds (6 allocations: 381.470 MiB)

julia> @time @views A * B[:,1:5000];
  0.151660 seconds (8 allocations: 190.735 MiB, 7.51% gc time)

FWIW, matrix multiplication is O(N^3), copying memory from the slices is O(N^2), and D dynamic dispatches is O(D) (but with a much heftier coefficient).

baggepinnen · February 29, 2020, 7:14am

No, you are of course right. I based my comment on the first timing in the OP which maybe was not super relevant for the rest of the post.

Topic		Replies	Views
Matrix-by-(slice of)vector multiplication with limited allocation New to Julia question	6	816	September 11, 2020
Inplace multiplication of sub-matrices without allocations Performance performance , linear-algebra , allocations	12	1758	December 12, 2022
Speeding up matrix multiplication with a subset of matrix's rows General Usage linearalgebra	12	972	July 24, 2019
Array views becoming dominant source of memory allocation Performance	31	5171	November 14, 2018
How to efficiently do a martix multiplication? New to Julia question	7	563	June 3, 2020

Submatrix multiply

Related topics