Avoiding intermediate vector allocations in matrix to scalar computations

I recently noticed that the computation of a scalar from vector' * matrix * vector is allocating memory. This isn’t hugely surprising - the output of the first multiplication is a vector. However, the allocation can easily be avoided - see the bAb_noalloc function below. My question is whether there is a way to tell the compiler to look for this? Or must I write a function to do it, as I have done?

There are two slightly different ways in which I need to do these products, and currently I’ve had to write a function for each one. It also makes the code harder to understand. So I’d like to avoid the special function if possible.

using BenchmarkTools

A = randn(10, 10)
b = randn(10)

bAb(A, b) = b' * A * b

function bAb_noalloc(A, b)
    total = zero(eltype(b))
    for ind in eachindex(b)
        total += @inbounds (b' * view(A, :, ind)) * b[ind]
    end
    return total
end

@btime bAb($A, $b)
@btime bAb_noalloc($A, $b)

99.090 ns (1 allocation: 144 bytes)
80.492 ns (0 allocations: 0 bytes)

In this specific case you could just use dot(b,A,b) from LinearAlgebra.jl which uses BLAS and will generally be faster than your explicit loop.

using BenchmarkTools
using LinearAlgebra

A = randn(10, 10)
b = randn(10)

bAb(A, b) = b' * A * b

function bAb_noalloc(A, b)
    total = zero(eltype(b))
    for ind in eachindex(b)
        total += @inbounds (b' * view(A, :, ind)) * b[ind]
    end
    return total
end

@btime bAb($A, $b)
@btime bAb_noalloc($A, $b)
@btime dot($b, $A, $b)
@show isapprox(bAb(A,b), dot(b, A, b)) 

  114.067 ns (1 allocation: 144 bytes)
  125.188 ns (0 allocations: 0 bytes)
  72.140 ns (0 allocations: 0 bytes)
isapprox(bAb(A, b), dot(b, A, b)) = true
3 Likes

Terrific. I wasn’t aware of this three-argument form. Thank you!

I think it’s also just loops, see e.g. @less dot(b, A, b). For a large enough matrix, the one which does use BLAS will be faster, despite the allocation:

julia> A = randn(1000, 1000); b = randn(1000);

julia> begin
       @btime bAb($A, $b)
       @btime bAb_noalloc($A, $b)
       @btime dot($b, $A, $b)
       @show isapprox(bAb(A,b), dot(b, A, b)) 
       end;
  min 84.000 μs, mean 118.693 μs (1 allocation, 7.94 KiB)
  min 651.833 μs, mean 697.378 μs (0 allocations)
  min 409.375 μs, mean 445.740 μs (0 allocations)
isapprox(bAb(A, b), dot(b, A, b)) = true
2 Likes

Oh. Well spotted. What a shame there’s a transition from one code being faster to another.

It appears there’s no BLAS function which does b' * A * b in one go. That’s a pity.

You can use LoopVectorization.jl to get somewhat close in performance (at least on my laptop):

using LoopVectorization
function bAb_loopvectorization(A,b)
	total = zero(eltype(b))
	@tturbo for i in eachindex(b), j in eachindex(b)
		total += b[i]*A[i,j]*b[j]
	end
	return total
end
Full code snippet
Pkg.activate(;temp=true)
Pkg.add(["BenchmarkTools", "LinearAlgebra", "LoopVectorization"])
using BenchmarkTools
using LinearAlgebra
using LoopVectorization

A = randn(10, 10)
b = randn(10)

bAb(A, b) = b' * A * b

function bAb_noalloc(A, b)
    total = zero(eltype(b))
    for ind in eachindex(b)
        total += @inbounds (b' * view(A, :, ind)) * b[ind]
    end
    return total
end

function bAb_loopvectorization(A,b)
	total = zero(eltype(b))
	@tturbo for i in eachindex(b), j in eachindex(b)
		total += b[i]*A[i,j]*b[j]
	end
	return total
end

bAb!(w,A,b) = b'*mul!(w,A',b)

function bench(N)
	A = randn(N, N)
	b = randn(N)
	w = similar(b)
	@assert isapprox(bAb(A,b), dot(b, A, b))
	@assert isapprox(bAb!(w, A,b), dot(b, A, b))
	@assert isapprox(bAb_loopvectorization(A,b), dot(b, A, b))
	println("Benchmark: $N")
	println("naive")
	@btime bAb($A, $b)
	println("explicite loop")
	@btime bAb_noalloc($A, $b)
	println("loopvectorization")
	@btime bAb_loopvectorization($A, $b)
	println("dot")
	@btime dot($b, $A, $b)
	println("bAb!")
	@btime bAb!($w, $A, $b)
	println()
end

bench(10)
bench(100)
bench(1000)
bench(10000)

This gives me the results

matrix size bAb bAb_noalloc dot bAb_loopvectorization bAb!
1e1 108.411 ns 125.656 ns 54.433 ns 19.335 ns 68.340 ns
1e2 1.320 μs 1.928 μs 832.520 ns 745.377 ns 1.187 μs
1e3 17.042 μs 362.067 μs 342.302 μs 69.913 μs 16.273 μs
1e4 20.249 ms 35.923 ms 36.475 ms 19.696 ms 20.214 ms

Run with 8 Julia threads and 8 BLAS threads (that why used @tturbo which generated threaded code as opposed to @turbo).
LV wins for small matrices by quite a bit, but loses out in an intermediate regime and then wins again.

Edit: Also added a version where the temporary is preallocated, but this of course is more annoying to call and does only offer gains for small(ish) matrices.

bAb!(w,A,b) = b'*mul!(w,A',b) # note A' is faster than A
1 Like

Very helpful. Thank you! :heart: