Best way to write dense rank-one update?


For a rank-one update of a dense matrix in Matlab, I would write A=A+u*v'. If I just use the same Matlab statement in Julia, I’ll get poor performance because of needless memory allocation and copying. I know how to write this with loops to get performance, and I also know how to write it without explicit loops using the einsum macro. But is there a performant way to write it using the new generator and dot-operator features?

A .= A .+ u*v'

should be efficient. Note that I did not use a . on the *. This makes sense because the * in this context is not an element-wise operation but an outer product. I’m actually a bit confused as to what exactly .* does here (it returns the same result but is slow).

The einsum package isn’t maintained anymore, but check out TensorOperations.jl. Watch out though, if I remember correctly functions in that package liked to wrap the results, I remember not being totally happy with it for that reason.

Update: Testing out TensorOperations a bit it did not seem to be wrapping the results, however it did not seem to like dotted operators very much, so I have no idea how performant it is.


BLAS.ger! performs a rank-1 update in place

help?> BLAS.ger!
  ger!(alpha, x, y, A)

  Rank-1 update of the matrix A with vectors x and y as alpha*x*y' + A.


Matrix multiplication has a temporary. I don’t believe this will automatically parse to BLAS.ger!, and so @dmbates’s suggestion should be used.


u.*v' should do exactly the right thing here, because broadcasting over the shapes (n,1) and (1,m) results in a (n,m) shape. Also, on v0.6, I don’t find that it is slow. Comparing the suggestions so far:

using TensorOperations, BenchmarkTools

ru1!(A,u,v) = A .+= u*v'
ru2!(A,u,v) = A .+= u.*v'
ru3!(A,u,v) = @tensor A[i,j] += u[i]*v[j]
ru4!(A,u,v) = BLAS.ger!(1.0, u, v, A);

A=rand(1000,800); u=rand(1000); v=rand(800);
@btime ru1!($A,$u,$v);
@btime ru2!($A,$u,$v);
@btime ru3!($A,$u,$v);
@btime ru4!($A,$u,$v);


2.281 ms (2 allocations: 6.10 MiB)
646.828 μs (1 allocation: 16 bytes)
1.385 ms (31 allocations: 1.41 KiB)
667.131 μs (0 allocations: 0 bytes)

So the fused broadcast version (ru2!) is as fast as the BLAS version, and only causes one tiny allocation (for the transpose view of v, I guess). @tensor has too much overhead for this simple operation, but I found it rather good if you have multidimensional transposes/permutations where it’s not at all easy to get a cache-friendly operation schedule.


I presume @traktofon ran with BLAS.set_num_threads(1) — otherwise BLAS.ger! is the only multithreaded version as of v0.6, and should be faster.


More precisely, I ran it with the default; unfortunately there is no BLAS.get_num_threads so I’m not sure how many threads that is, though from observing the CPU meter on my machine I guess it’s four, not one. Whether mulithreading is faster or not depends on the machine; this operation is strongly memory-bound, because each location of A gets read and updated just once. Explicitly setting the number of threads, I get

nthreads=1: 639.515 μs (0 allocations: 0 bytes)
nthreads=2: 645.072 μs (0 allocations: 0 bytes)
nthreads=4: 660.982 μs (0 allocations: 0 bytes)

but this is running on my laptop which has only two real cores (plus hyperthreading) and only employs a single DIMM, so the memory access is shared between all threads. On a machine with more memory channels (two channels should be standard on desktop PCs nowadays) the threading probably helps. Modern compute servers with multiple CPU sockets and up to four memory channels will likely see vast improvements.

Edit: now had a chance to test on a desktop with two memory channels and four cores. Results as expected:

ru2!: 245.488 μs (1 allocation: 16 bytes)
ru4! nthreads=1: 220.873 μs (0 allocations: 0 bytes)
ru4! nthreads=2: 111.577 μs (0 allocations: 0 bytes)
ru4! nthreads=4: 110.171 μs (0 allocations: 0 bytes)


You have 4 threads on your PC, probably.

Core(s) per socket:    2
Thread(s) per core:    2
=> CPU(s):             4

On ubuntu command lscpu

To get the number of threads, use
ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
The default value is the number of CPUs.
Julia has built-in function Sys.cpu_info().