For a rank-one update of a dense matrix in Matlab, I would write A=A+u*v'
. If I just use the same Matlab statement in Julia, I’ll get poor performance because of needless memory allocation and copying. I know how to write this with loops to get performance, and I also know how to write it without explicit loops using the einsum macro. But is there a performant way to write it using the new generator and dot-operator features?
A .= A .+ u*v'
should be efficient. Note that I did not use a .
on the *
. This makes sense because the *
in this context is not an element-wise operation but an outer product. I’m actually a bit confused as to what exactly .*
does here (it returns the same result but is slow).
The einsum package isn’t maintained anymore, but check out TensorOperations.jl. Watch out though, if I remember correctly functions in that package liked to wrap the results, I remember not being totally happy with it for that reason.
Update: Testing out TensorOperations a bit it did not seem to be wrapping the results, however it did not seem to like dotted operators very much, so I have no idea how performant it is.
BLAS.ger!
performs a rank-1 update in place
help?> BLAS.ger!
ger!(alpha, x, y, A)
Rank-1 update of the matrix A with vectors x and y as alpha*x*y' + A.
Matrix multiplication has a temporary. I don’t believe this will automatically parse to BLAS.ger!
, and so @dmbates’s suggestion should be used.
u.*v'
should do exactly the right thing here, because broadcasting over the shapes (n,1)
and (1,m)
results in a (n,m)
shape. Also, on v0.6, I don’t find that it is slow. Comparing the suggestions so far:
using TensorOperations, BenchmarkTools
ru1!(A,u,v) = A .+= u*v'
ru2!(A,u,v) = A .+= u.*v'
ru3!(A,u,v) = @tensor A[i,j] += u[i]*v[j]
ru4!(A,u,v) = BLAS.ger!(1.0, u, v, A);
A=rand(1000,800); u=rand(1000); v=rand(800);
@btime ru1!($A,$u,$v);
@btime ru2!($A,$u,$v);
@btime ru3!($A,$u,$v);
@btime ru4!($A,$u,$v);
Result:
2.281 ms (2 allocations: 6.10 MiB)
646.828 μs (1 allocation: 16 bytes)
1.385 ms (31 allocations: 1.41 KiB)
667.131 μs (0 allocations: 0 bytes)
So the fused broadcast version (ru2!
) is as fast as the BLAS version, and only causes one tiny allocation (for the transpose view of v
, I guess). @tensor
has too much overhead for this simple operation, but I found it rather good if you have multidimensional transposes/permutations where it’s not at all easy to get a cache-friendly operation schedule.
I presume @traktofon ran with BLAS.set_num_threads(1)
— otherwise BLAS.ger!
is the only multithreaded version as of v0.6, and should be faster.
More precisely, I ran it with the default; unfortunately there is no BLAS.get_num_threads
so I’m not sure how many threads that is, though from observing the CPU meter on my machine I guess it’s four, not one. Whether mulithreading is faster or not depends on the machine; this operation is strongly memory-bound, because each location of A
gets read and updated just once. Explicitly setting the number of threads, I get
nthreads=1: 639.515 μs (0 allocations: 0 bytes)
nthreads=2: 645.072 μs (0 allocations: 0 bytes)
nthreads=4: 660.982 μs (0 allocations: 0 bytes)
but this is running on my laptop which has only two real cores (plus hyperthreading) and only employs a single DIMM, so the memory access is shared between all threads. On a machine with more memory channels (two channels should be standard on desktop PCs nowadays) the threading probably helps. Modern compute servers with multiple CPU sockets and up to four memory channels will likely see vast improvements.
Edit: now had a chance to test on a desktop with two memory channels and four cores. Results as expected:
ru2!: 245.488 μs (1 allocation: 16 bytes)
ru4! nthreads=1: 220.873 μs (0 allocations: 0 bytes)
ru4! nthreads=2: 111.577 μs (0 allocations: 0 bytes)
ru4! nthreads=4: 110.171 μs (0 allocations: 0 bytes)
You have 4 threads on your PC, probably.
Core(s) per socket: 2
Thread(s) per core: 2
=> CPU(s): 4
On ubuntu command lscpu
To get the number of threads, use
ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
The default value is the number of CPUs.
Julia has built-in function Sys.cpu_info()
.
The discussion above it was difficult to see the conclusion, and we’ve had quite a bit of improvement of Julia since this was posted. Also, the five-argument mul!
-column-by-column approach can also be listed in competitive methods.
using LinearAlgebra,BenchmarkTools,TensorOperations
ru1!(A,u,v) = A .+= u*v'
ru2!(A,u,v) = A .+= u.*v'
ru3!(A,u,v) = @tensor A[i,j] += u[i]*v[j]
ru4!(A,u,v) = BLAS.ger!(1.0, u, v, A);
function ru5!(A,u,v);
(n,m)=size(A);
for j=1:m;
mul!(view(A,1:n,j),u,v[j], true,true);
end;
end
n=1024*8; m=512*4;
A=rand(n,m); u=rand(n); v=rand(m);
@btime ru1!($A,$u,$v);
@btime ru2!($A,$u,$v);
@btime ru3!($A,$u,$v);
@btime ru4!($A,$u,$v);
@btime ru5!($A,$u,$v);
versioninfo()
Result:
65.027 ms (2 allocations: 128.00 MiB)
15.223 ms (0 allocations: 0 bytes)
12.924 ms (0 allocations: 0 bytes)
11.325 ms (0 allocations: 0 bytes)
11.481 ms (0 allocations: 0 bytes)
Julia Version 1.8.5
Five arg mul!
seems to close to BLAS-performance. Five arg mul!
is written in julia and has the usual single-language advantage, eg might work for types not implemented in BLAS.
FWIW, I get
julia> using LoopVectorization
julia> ru6!(A,u,v) = @turbo A .+= u.*v'
ru6! (generic function with 1 method)
julia> ru7!(A,u,v) = @tturbo A .+= u.*v'
ru7! (generic function with 1 method)
julia> @btime ru1!($A,$u,$v);
37.960 ms (2 allocations: 128.00 MiB)
julia> @btime ru2!($A,$u,$v);
16.157 ms (0 allocations: 0 bytes)
julia> @btime ru3!($A,$u,$v);
3.279 ms (0 allocations: 0 bytes)
julia> @btime ru4!($A,$u,$v);
3.418 ms (0 allocations: 0 bytes)
julia> @btime ru5!($A,$u,$v);
14.503 ms (0 allocations: 0 bytes)
julia> @btime ru6!($A,$u,$v);
12.884 ms (0 allocations: 0 bytes)
julia> @btime ru7!($A,$u,$v);
4.157 ms (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.10.0-DEV.574
Commit d8c2250074* (2023-02-11 11:07 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
Threads: 36 on 36 virtual cores
So a clear win for BLAS, thanks to leveraging multithreading.
Note that this benchmark is totally constrained by memory bandwidth, so the actual performance gain we can get is fairly limited. For reference
julia> @btime sum($A);
10.120 ms (0 allocations: 0 bytes)
The single threaded @turbo
rank-1 update was less than 30% slower, despite also writing to A
instead of merely reading.