Submatrix multiply

Hi, my algorithm involves a lot of submatrix multiply. I got inspired by the use of view, but the speedup of doing so is not that satisfying. Below is some sample code:

If we just try slicing the matrix, the speedup is pretty significant:

julia> A = randn(10000,10000);
julia> @time @view A[:1:5000];
  0.000004 seconds (8 allocations: 320 bytes)

julia> @time A[:1:5000];
  0.013570 seconds (22.51 k allocations: 1.085 MiB)

However, with the same dimension, if we do a submatrix multiply, I got this:

julia> B = randn(5000,5000);

julia> @time A[:,1:5000] =  A[:, 1:5000]*B;
  1.834928 seconds (14 allocations: 762.940 MiB)

julia> @time @views A[:,1:5000] =  A[:, 1:5000]*B;
  1.667780 seconds (16 allocations: 381.470 MiB, 1.64% gc time)

which is not so different. Further, if we increase the size of the matrix,

julia> A = randn(50000,50000);

julia> B = randn(10000,10000);

julia> @time A[:,1:10000] = A[:,1:10000]*B;
 82.754603 seconds (14 allocations: 7.451 GiB, 1.46% gc time)

julia> @time @views A[:,1:10000] = A[:,1:10000]*B;
 56.483068 seconds (16 allocations: 3.725 GiB, 1.44% gc time)

the speedup is a bit better.

I’m wondering if this speedup looks normal? In general, what level of speedup should I expect?
Any help is appreciated!

You will almost never get good results by making a view and then using it to do unaligned array accesses. This is because in there cases, all of your accesses are cache misses, so it ends up being as slow as just copying the data

There are problems with the way you benchmark, have a look at BenchmarkTools.jl to make sure you make the correct inferences about timings.

Why is it unaligned? I thought in Julia matices are stored in a column-major fashion? The sub-columns are stored consecutively in mem or cache?

But @time is natively supported by Julia, I didn’t use BenchmarkTools at all.

Indeed it is, but you will make false conclusions if using it in the way you did. For instance, the variables you use are global, a huge detriment to performance if benchmarked with @time

I don’t think it’ll make much difference at the scale of 5000 x 5000 matrices.

That said, I see a pretty notable difference between @benchmark mul!($C, $A, $B) and A * B, and a very large difference with and without views. The views are almost equally fast as not slicing.
BenchmarkTools helps with is noise, and not timing compilation.
My results:

julia> using BenchmarkTools, LinearAlgebra

julia> A = rand(5000, 5000); B = rand(5000, 5000); C = similar(A);

julia> @time A * B;
  0.152603 seconds (2 allocations: 190.735 MiB, 6.12% gc time)

julia> @time A * B;
  0.144549 seconds (2 allocations: 190.735 MiB)

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     123.864 ms (0.00% GC)
  median time:      125.252 ms (0.00% GC)
  mean time:        125.388 ms (0.00% GC)
  maximum time:     127.998 ms (0.00% GC)
  --------------
  samples:          40
  evals/sample:     1

julia> @time A * B[:,1:5000];
  0.233827 seconds (162.54 k allocations: 389.686 MiB)

julia> @time A * B[:,1:5000];
  0.211714 seconds (6 allocations: 381.470 MiB, 7.11% gc time)

julia> @time @views A * B[:,1:5000];
  0.589579 seconds (2.65 M allocations: 308.709 MiB, 2.86% gc time)

julia> @time @views A * B[:,1:5000];
  0.145052 seconds (8 allocations: 190.735 MiB)

julia> @time A * B[:,1:5000];
  0.203089 seconds (6 allocations: 381.470 MiB)

julia> @time @views A * B[:,1:5000];
  0.151660 seconds (8 allocations: 190.735 MiB, 7.51% gc time)

FWIW, matrix multiplication is O(N^3), copying memory from the slices is O(N^2), and D dynamic dispatches is O(D) (but with a much heftier coefficient).

1 Like

No, you are of course right. I based my comment on the first timing in the OP which maybe was not super relevant for the rest of the post.