Element wise multiplication of lower triangular matrices

ahmed · August 27, 2023, 8:36am

I’m confused by how the element wise multiplication of lower triangular matrices is not faster than the element wise multiplication of the completely full matrices. This behavior is shown in the following MWE

using LinearAlgebra, BenchmarkTools

A = rand(ComplexF64, 1000, 1000); 
B = rand(ComplexF64, 1000, 1000);

function f(A, B)
	UnitLowerTriangular(A) .* UnitLowerTriangular(B)
end

function f2(A, B)
	return A .* B
end

The benchmark of the two functions give

@btime f(A, B)
 4.017 ms (2 allocations: 15.26 MiB)

@btime f2(A, B)
3.719 ms (2 allocations: 15.26 MiB)

The number of operations in the lower triangular matrices should be close to half of that in the complete matrices; so I’d think it should take half the time?

Mason · August 27, 2023, 8:48am

The problem is that for most operations involving symmetric or triangular matrices, yes you can get a 2x speedup by only doing half the operations, but it would come at the cost of losing the typically ~8x speedup you can get from SIMD.

For this reason, triangular and symmetric matrices are typically actually stored just like regular matrices, but the lower or upper triangle are treated as ‘junk’ bits which are ignored.

So if we used denser linear storage yes we could get 2x faster element-wise multiplication, but things like matrix-multiplication would suffer immensely.

ahmed · August 27, 2023, 9:01am

Thanks @Mason for your insight. Actually the reason I’m asking this question is because I’m doing an element wise multiplication of two Hermitian matrices and since the answer will be another Hermitian matrix so I was hoping I could save almost half the time this way.

I’m still not really clear on why this speed up can’t be done.

RoyiAvital · August 27, 2023, 11:13am

Maybe one could build a data structure with both representations.
One will be used for matrix operations while the other for element wise operations.

DNF · August 27, 2023, 1:43pm

There’s no reason you can’t have it both ways, at least if you help the simd manually. I presume there’s a challenge getting it to happen automatically, though.

RoyiAvital · August 27, 2023, 2:54pm

I think @Mason meant the loading of the data is not from a contiguous memory.
Hence the performance benefit is much lower as you need to do multiple non SIMD loads into a vector and then do the SIMD processing (Multiplication).

dmbates · August 27, 2023, 3:00pm

I think the point is that matrix multiplication is one of the most highly optimized operations in any custom BLAS suite. At a minimum Julia uses OpenBlas but there are other options such as MKL for Intel processors and AppleAccelerate for Apple M-series processors. A lot of effort went into those implementations by people at the respective companies working closely with the hardware folks. Reproducing such intensely optimized code for triangular or Hermitian matrices would be a huge undertaking.

There is the RectangularFullPacked package available. It uses a reduced-size representation of triangular or Hermitian matrices that doesn’t lose too much performance relative to having a full size square array as a backing store.

DNF · August 27, 2023, 5:58pm

You can do all simd loads and process everything with simd. You must just accept that not each load will be contiguous with the previous one, and that there will be a slight redundancy in how much data you must process.

This particular case was elementwise multiplication, though, not matrix product.

Mason · August 27, 2023, 7:04pm

I never meant that if was impossible to get this 2x speedup for triangular structures (in fact, I did say that it was possible for element wise stuff). My point was just that the algorithms and data structures are all designed in a format that makes getting this last 2x kinda inconvenient and awkward.

Someone could definitely do it, it’s just that elementwise operations arent really something people typically care enough about to optimize that hard.

dmbates · August 27, 2023, 7:16pm

You’re right. This is what happens when I reply to messages before coffee.

RoyiAvital · August 29, 2023, 7:21am

It is not only contiguous, it will be not aligned.
Unless you do indices tricks in the loop which then means you will need to do everything in hand as the compiler will see it as data dependency.
So probably you have to create a different data structure optimized for this.
Which will cause issues in the cases of element wise multiplication with other structures and other operations.

DNF · August 29, 2023, 7:44am

Not sure what you mean. If the array is aligned and you have regular numbers (e.g. 32 or 64-bit) you can start your load at any arbitrary position in the array, except you must be careful near the end (I’ve been spending the past few weeks jumping around arrays with avx loads exactly like that).

You need to calculate when and how far to skip for each column. Depending on how hard this is, it may end up costing more than what you save, but I think it should work well for large matrices.

As @Mason said, the benefits may not be great enough to justify the extra work and complexity, in the general case.

DNF · August 29, 2023, 12:37pm

Just for fun, I implemented my idea using SIMD.jl. It only works for real-valued matrices, complex values are a different kettle of fish.

For 1000x1000 matrices, it is approximately 2x as fast as a regular elementwise multiplication over the (dense) parent array, pretty close to what I predicted. (I also tried LoopVectorization.jl, which offered no speedup.) It is not fully optimal, nor is it really ‘safe’ with respect to all sizes and indexing schemes (it will probably fail for very small matrices, for example):

function simdmul!(C::LowerTriangular{T}, A::LowerTriangular{T}, B::LowerTriangular{T}) where {T<:Real}
	size(A) == size(B) == size(C) || error("Incompatible sizes.")
	N = div(512, 8 * sizeof(T))
	lowermul!(parent(C), parent(A), parent(B), Val(N))
end

function lowermul!(C, A, B, ::Val{N}) where {N}
	lane = VecRange{N}(firstindex(A, 1))
	(J, K) = size(C)
	J_ = J
	skip = 0
	for _ in 1:K-1
		for j in skip:N:J_
			lanej = lane + j
			C[lanej] = A[lanej] * B[lanej]
		end
		skip += J + 1
		J_ += J
	end
	C[end] = A[end] * B[end]
	return C
end

DNF · August 29, 2023, 12:57pm

Just for fun, I implemented my idea using SIMD.jl. It only works for real-valued matrices, complex values are a different kettle of fish.

For 1000x1000 matrices, it is approximately 2x as fast as a regular elementwise multiplication over the (dense) parent array, pretty close to what I predicted. I also tried LoopVectorization It is not fully optimal, nor is it really ‘safe’ with respect to all sizes and indexing schemes (it will probably fail for very small matrices, for example):

function simdmul!(C::LowerTriangular{T}, A::LowerTriangular{T}, B::LowerTriangular{T}) where {T<:Real}
	size(A) == size(B) == size(C) || error("Incompatible sizes.")
	N = div(512, 8 * sizeof(T))
	lowermul!(parent(C), parent(A), parent(B), Val(N))
end

function lowermul!(C, A, B, ::Val{N}) where {N}
	lane = VecRange{N}(firstindex(A, 1))
	(J, K) = size(C)
	J_ = J
	skip = 0
	for _ in 1:K-1
		for j in skip:N:J_
			lanej = lane + j
			C[lanej] = A[lanej] * B[lanej]
		end
		skip += J + 1  # <- this is the trick that saves operations 
		J_ += J
	end
	C[end] = A[end] * B[end]
	return C
end

Edit: Actually, you don’t need manual SIMD, you can just ~~ask LoopVectorization to vectorize the inner loop for you~~ do the completely simple naive thing, leading to cleaner code:

function automul!(C::LowerTriangular{T}, A::LowerTriangular{T}, B::LowerTriangular{T}) where {T<:Real}
	size(A) == size(B) == size(C) || error("Incompatible sizes.")
	lowermul!(parent(C), parent(A), parent(B))
	return C
end

function lowermul!(C, A, B)
	for k in axes(C, 2)
		for j in k:lastindex(C, 1)
			C[j, k] = A[j, k] * B[j, k]
		end
	end
	return C
end

RoyiAvital · August 29, 2023, 2:24pm

DNF:

Edit: Actually, you don’t need manual SIMD, you can just ~~ask LoopVectorization to vectorize the inner loop for you~~ do the completely simple naive thing, leading to cleaner code:
function automul!(C::LowerTriangular{T}, A::LowerTriangular{T}, B::LowerTriangular{T}) where {T<:Real}
	size(A) == size(B) == size(C) || error("Incompatible sizes.")
	lowermul!(parent(C), parent(A), parent(B))
	return C
end

function lowermul!(C, A, B)
	for k in axes(C, 2)
		for j in k:lastindex(C, 1)
			C[j, k] = A[j, k] * B[j, k]
		end
	end
	return C
end

I was under the impression Julia won’t generate SIMD code for triangle loop.
Can you verify? It seems strange.

DNF · August 29, 2023, 2:40pm

Here’s the weird thing. If I do

function tmul!(C, A, B)
	@turbo for i in eachindex(A, B, C)
		C[i] = A[i] * B[i]
	end
	return C
end

and call it with regular 1000x1000 arrays, @code_llvm shows plenty of load <4 x double>. The vanilla triangular loop shows no vectorization in @code_llvm. And still, it is twice as fast.

What’s going on?

mikmoore · August 29, 2023, 6:06pm

Is this a symptom of memory-bound operations? SIMD doesn’t make memory any faster while the triangular loop only pulls ~half the memory.

DNF · August 29, 2023, 6:20pm

Yeah, that’s it. Reducing the size to 20x20 changes the picture.

Topic		Replies	Views
Triangular Arrays in julia? Offtopic	25	6591	June 7, 2020
Why is Matrix multiplication with an UpperTriangular slower than with normal Matrix? Performance matrices	2	517	June 30, 2021
mul!(::Matrix, ::Matrix, ::UpperTriangular) is really slow Performance linearalgebra	4	450	October 8, 2021
Why mul! is so fast? General Usage question , linearalgebra	7	6950	November 26, 2019
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	3015	July 16, 2018

Element wise multiplication of lower triangular matrices

Related topics