Element wise multiplication of lower triangular matrices

I’m confused by how the element wise multiplication of lower triangular matrices is not faster than the element wise multiplication of the completely full matrices. This behavior is shown in the following MWE

using LinearAlgebra, BenchmarkTools

A = rand(ComplexF64, 1000, 1000); 
B = rand(ComplexF64, 1000, 1000);

function f(A, B)
	UnitLowerTriangular(A) .* UnitLowerTriangular(B)
end

function f2(A, B)
	return A .* B
end 

The benchmark of the two functions give

@btime f(A, B)
 4.017 ms (2 allocations: 15.26 MiB)
@btime f2(A, B)
3.719 ms (2 allocations: 15.26 MiB)

The number of operations in the lower triangular matrices should be close to half of that in the complete matrices; so I’d think it should take half the time?

1 Like

The problem is that for most operations involving symmetric or triangular matrices, yes you can get a 2x speedup by only doing half the operations, but it would come at the cost of losing the typically ~8x speedup you can get from SIMD.

For this reason, triangular and symmetric matrices are typically actually stored just like regular matrices, but the lower or upper triangle are treated as ‘junk’ bits which are ignored.

So if we used denser linear storage yes we could get 2x faster element-wise multiplication, but things like matrix-multiplication would suffer immensely.

1 Like

Thanks @Mason for your insight. Actually the reason I’m asking this question is because I’m doing an element wise multiplication of two Hermitian matrices and since the answer will be another Hermitian matrix so I was hoping I could save almost half the time this way.

I’m still not really clear on why this speed up can’t be done.

Maybe one could build a data structure with both representations.
One will be used for matrix operations while the other for element wise operations.

There’s no reason you can’t have it both ways, at least if you help the simd manually. I presume there’s a challenge getting it to happen automatically, though.

1 Like

I think @Mason meant the loading of the data is not from a contiguous memory.
Hence the performance benefit is much lower as you need to do multiple non SIMD loads into a vector and then do the SIMD processing (Multiplication).

1 Like

I think the point is that matrix multiplication is one of the most highly optimized operations in any custom BLAS suite. At a minimum Julia uses OpenBlas but there are other options such as MKL for Intel processors and AppleAccelerate for Apple M-series processors. A lot of effort went into those implementations by people at the respective companies working closely with the hardware folks. Reproducing such intensely optimized code for triangular or Hermitian matrices would be a huge undertaking.

There is the RectangularFullPacked package available. It uses a reduced-size representation of triangular or Hermitian matrices that doesn’t lose too much performance relative to having a full size square array as a backing store.

2 Likes

You can do all simd loads and process everything with simd. You must just accept that not each load will be contiguous with the previous one, and that there will be a slight redundancy in how much data you must process.

This particular case was elementwise multiplication, though, not matrix product.

I never meant that if was impossible to get this 2x speedup for triangular structures (in fact, I did say that it was possible for element wise stuff). My point was just that the algorithms and data structures are all designed in a format that makes getting this last 2x kinda inconvenient and awkward.

Someone could definitely do it, it’s just that elementwise operations arent really something people typically care enough about to optimize that hard.

1 Like

You’re right. This is what happens when I reply to messages before coffee.

2 Likes

It is not only contiguous, it will be not aligned.
Unless you do indices tricks in the loop which then means you will need to do everything in hand as the compiler will see it as data dependency.
So probably you have to create a different data structure optimized for this.
Which will cause issues in the cases of element wise multiplication with other structures and other operations.

1 Like

Not sure what you mean. If the array is aligned and you have regular numbers (e.g. 32 or 64-bit) you can start your load at any arbitrary position in the array, except you must be careful near the end (I’ve been spending the past few weeks jumping around arrays with avx loads exactly like that).

You need to calculate when and how far to skip for each column. Depending on how hard this is, it may end up costing more than what you save, but I think it should work well for large matrices.

As @Mason said, the benefits may not be great enough to justify the extra work and complexity, in the general case.

Just for fun, I implemented my idea using SIMD.jl. It only works for real-valued matrices, complex values are a different kettle of fish.

For 1000x1000 matrices, it is approximately 2x as fast as a regular elementwise multiplication over the (dense) parent array, pretty close to what I predicted. (I also tried LoopVectorization.jl, which offered no speedup.) It is not fully optimal, nor is it really ‘safe’ with respect to all sizes and indexing schemes (it will probably fail for very small matrices, for example):

function simdmul!(C::LowerTriangular{T}, A::LowerTriangular{T}, B::LowerTriangular{T}) where {T<:Real}
	size(A) == size(B) == size(C) || error("Incompatible sizes.")
	N = div(512, 8 * sizeof(T))
	lowermul!(parent(C), parent(A), parent(B), Val(N))
end

function lowermul!(C, A, B, ::Val{N}) where {N}
	lane = VecRange{N}(firstindex(A, 1))
	(J, K) = size(C)
	J_ = J
	skip = 0
	for _ in 1:K-1
		for j in skip:N:J_
			lanej = lane + j
			C[lanej] = A[lanej] * B[lanej]
		end
		skip += J + 1
		J_ += J
	end
	C[end] = A[end] * B[end]
	return C
end
2 Likes

Just for fun, I implemented my idea using SIMD.jl. It only works for real-valued matrices, complex values are a different kettle of fish.

For 1000x1000 matrices, it is approximately 2x as fast as a regular elementwise multiplication over the (dense) parent array, pretty close to what I predicted. I also tried LoopVectorization It is not fully optimal, nor is it really ‘safe’ with respect to all sizes and indexing schemes (it will probably fail for very small matrices, for example):

function simdmul!(C::LowerTriangular{T}, A::LowerTriangular{T}, B::LowerTriangular{T}) where {T<:Real}
	size(A) == size(B) == size(C) || error("Incompatible sizes.")
	N = div(512, 8 * sizeof(T))
	lowermul!(parent(C), parent(A), parent(B), Val(N))
end

function lowermul!(C, A, B, ::Val{N}) where {N}
	lane = VecRange{N}(firstindex(A, 1))
	(J, K) = size(C)
	J_ = J
	skip = 0
	for _ in 1:K-1
		for j in skip:N:J_
			lanej = lane + j
			C[lanej] = A[lanej] * B[lanej]
		end
		skip += J + 1  # <- this is the trick that saves operations 
		J_ += J
	end
	C[end] = A[end] * B[end]
	return C
end

Edit: Actually, you don’t need manual SIMD, you can just ask LoopVectorization to vectorize the inner loop for you do the completely simple naive thing, leading to cleaner code:

function automul!(C::LowerTriangular{T}, A::LowerTriangular{T}, B::LowerTriangular{T}) where {T<:Real}
	size(A) == size(B) == size(C) || error("Incompatible sizes.")
	lowermul!(parent(C), parent(A), parent(B))
	return C
end

function lowermul!(C, A, B)
	for k in axes(C, 2)
		for j in k:lastindex(C, 1)
			C[j, k] = A[j, k] * B[j, k]
		end
	end
	return C
end

I was under the impression Julia won’t generate SIMD code for triangle loop.
Can you verify? It seems strange.

Here’s the weird thing. If I do

function tmul!(C, A, B)
	@turbo for i in eachindex(A, B, C)
		C[i] = A[i] * B[i]
	end
	return C
end

and call it with regular 1000x1000 arrays, @code_llvm shows plenty of load <4 x double>. The vanilla triangular loop shows no vectorization in @code_llvm. And still, it is twice as fast.

What’s going on?

Is this a symptom of memory-bound operations? SIMD doesn’t make memory any faster while the triangular loop only pulls ~half the memory.

2 Likes

Yeah, that’s it. Reducing the size to 20x20 changes the picture.