Avoiding intermediate vector allocations in matrix to scalar computations

user664303 · September 19, 2023, 11:04am

I recently noticed that the computation of a scalar from vector' * matrix * vector is allocating memory. This isn’t hugely surprising - the output of the first multiplication is a vector. However, the allocation can easily be avoided - see the bAb_noalloc function below. My question is whether there is a way to tell the compiler to look for this? Or must I write a function to do it, as I have done?

There are two slightly different ways in which I need to do these products, and currently I’ve had to write a function for each one. It also makes the code harder to understand. So I’d like to avoid the special function if possible.

using BenchmarkTools

A = randn(10, 10)
b = randn(10)

bAb(A, b) = b' * A * b

function bAb_noalloc(A, b)
    total = zero(eltype(b))
    for ind in eachindex(b)
        total += @inbounds (b' * view(A, :, ind)) * b[ind]
    end
    return total
end

@btime bAb($A, $b)
@btime bAb_noalloc($A, $b)

99.090 ns (1 allocation: 144 bytes)
80.492 ns (0 allocations: 0 bytes)

abraemer · September 19, 2023, 11:39am

In this specific case you could just use dot(b,A,b) from LinearAlgebra.jl which uses BLAS and will generally be faster than your explicit loop.

using BenchmarkTools
using LinearAlgebra

A = randn(10, 10)
b = randn(10)

bAb(A, b) = b' * A * b

function bAb_noalloc(A, b)
    total = zero(eltype(b))
    for ind in eachindex(b)
        total += @inbounds (b' * view(A, :, ind)) * b[ind]
    end
    return total
end

@btime bAb($A, $b)
@btime bAb_noalloc($A, $b)
@btime dot($b, $A, $b)
@show isapprox(bAb(A,b), dot(b, A, b)) 

  114.067 ns (1 allocation: 144 bytes)
  125.188 ns (0 allocations: 0 bytes)
  72.140 ns (0 allocations: 0 bytes)
isapprox(bAb(A, b), dot(b, A, b)) = true

user664303 · September 19, 2023, 11:43am

Terrific. I wasn’t aware of this three-argument form. Thank you!

mcabbott · September 19, 2023, 12:11pm

I think it’s also just loops, see e.g. @less dot(b, A, b). For a large enough matrix, the one which does use BLAS will be faster, despite the allocation:

julia> A = randn(1000, 1000); b = randn(1000);

julia> begin
       @btime bAb($A, $b)
       @btime bAb_noalloc($A, $b)
       @btime dot($b, $A, $b)
       @show isapprox(bAb(A,b), dot(b, A, b)) 
       end;
  min 84.000 μs, mean 118.693 μs (1 allocation, 7.94 KiB)
  min 651.833 μs, mean 697.378 μs (0 allocations)
  min 409.375 μs, mean 445.740 μs (0 allocations)
isapprox(bAb(A, b), dot(b, A, b)) = true

user664303 · September 19, 2023, 1:58pm

Oh. Well spotted. What a shame there’s a transition from one code being faster to another.

It appears there’s no BLAS function which does b' * A * b in one go. That’s a pity.

abraemer · September 19, 2023, 2:34pm

You can use LoopVectorization.jl to get somewhat close in performance (at least on my laptop):

using LoopVectorization
function bAb_loopvectorization(A,b)
	total = zero(eltype(b))
	@tturbo for i in eachindex(b), j in eachindex(b)
		total += b[i]*A[i,j]*b[j]
	end
	return total
end

Full code snippet

Pkg.activate(;temp=true)
Pkg.add(["BenchmarkTools", "LinearAlgebra", "LoopVectorization"])
using BenchmarkTools
using LinearAlgebra
using LoopVectorization

A = randn(10, 10)
b = randn(10)

bAb(A, b) = b' * A * b

function bAb_noalloc(A, b)
    total = zero(eltype(b))
    for ind in eachindex(b)
        total += @inbounds (b' * view(A, :, ind)) * b[ind]
    end
    return total
end

function bAb_loopvectorization(A,b)
	total = zero(eltype(b))
	@tturbo for i in eachindex(b), j in eachindex(b)
		total += b[i]*A[i,j]*b[j]
	end
	return total
end

bAb!(w,A,b) = b'*mul!(w,A',b)

function bench(N)
	A = randn(N, N)
	b = randn(N)
	w = similar(b)
	@assert isapprox(bAb(A,b), dot(b, A, b))
	@assert isapprox(bAb!(w, A,b), dot(b, A, b))
	@assert isapprox(bAb_loopvectorization(A,b), dot(b, A, b))
	println("Benchmark: $N")
	println("naive")
	@btime bAb($A, $b)
	println("explicite loop")
	@btime bAb_noalloc($A, $b)
	println("loopvectorization")
	@btime bAb_loopvectorization($A, $b)
	println("dot")
	@btime dot($b, $A, $b)
	println("bAb!")
	@btime bAb!($w, $A, $b)
	println()
end

bench(10)
bench(100)
bench(1000)
bench(10000)

This gives me the results

matrix size	bAb	bAb_noalloc	dot	bAb_loopvectorization	bAb!
1e1	108.411 ns	125.656 ns	54.433 ns	19.335 ns	68.340 ns
1e2	1.320 μs	1.928 μs	832.520 ns	745.377 ns	1.187 μs
1e3	17.042 μs	362.067 μs	342.302 μs	69.913 μs	16.273 μs
1e4	20.249 ms	35.923 ms	36.475 ms	19.696 ms	20.214 ms

Run with 8 Julia threads and 8 BLAS threads (that why used @tturbo which generated threaded code as opposed to @turbo).
LV wins for small matrices by quite a bit, but loses out in an intermediate regime and then wins again.

Edit: Also added a version where the temporary is preallocated, but this of course is more annoying to call and does only offer gains for small(ish) matrices.

bAb!(w,A,b) = b'*mul!(w,A',b) # note A' is faster than A

user664303 · September 20, 2023, 9:48am

Very helpful. Thank you!

Topic		Replies	Views
Avoiding memory allocation for vector operations Performance	11	3235	July 25, 2018
Optimizing sums of products (dot products) Performance simd , linearalgebra , sum	17	552	September 24, 2024
Assistance in avoiding allocations in summations Performance	8	777	March 18, 2019
Memory allocation in dot product Performance question , memory-allocation	11	1207	September 9, 2021
Memory allocation in matrix multiplication Performance	2	2108	July 8, 2020

Avoiding intermediate vector allocations in matrix to scalar computations

Related topics