# SymTrididiagonal matrices performance and views allocations

Hi,

I work on a toy 2D CFD solver and I wonder about the idiomatic Julian way to compute efficiently the product for all j \in [1:ny]

Pxy[:,j]=Lx*Sxy[:j]


where Lx is a a SymTridiagonal matrix of rank nx and Pxy and Sxy are two 2D arrays of size (nx,ny).

I wrote a loop based function myprod_basic which seems to perform significantly faster than myprod. In addition myprod allocates memory:


3.858 ms (0 allocations: 0 bytes) #myprod_basic
8.958 ms (200000 allocations: 9.16 MiB) #myprod


Note that this snippet may eventually be part of some material for a Julia hands-on session.

Here is the MWE:

using BenchmarkTools
using LinearAlgebra
function myprod_basic(Pxy,Sxy,Lx)
nx,ny=size(Pxy)

d=Lx.dv
e=Lx.ev

for it=1:1000
for j=1:ny
@inbounds Pxy[1,j]=d*Sxy[1,j]+e*Sxy[2,j]
for i=2:nx-1
@inbounds Pxy[i,j]=d[i]*Sxy[i,j]+e[i-1]*Sxy[i-1,j]+e[i]*Sxy[i+1,j]
end
@inbounds Pxy[nx,j]=d[nx]*Sxy[nx,j]+e[nx-1]*Sxy[nx-1,j]
end
end
end

function myprod(Pxy,Sxy,Lx)
nx,ny=size(Pxy)
for it=1:1000
@inbounds for j=1:ny
@views mul!(Pxy[:,j],Lx,Sxy[:,j])
end
end
end

function tprod(nx,ny)
dv=rand(nx)
ev=rand(nx)
Lx=SymTridiagonal(dv,ev)

Pxy=zeros(nx,ny)
Pxyref=zeros(nx,ny)
Sxy=rand(nx,ny)

pj=Pxy[:,1]
sj=Sxy[:,1]

@btime mul!($pj,$Lx,$sj) @btime myprod_basic($Pxyref,$Sxy,$Lx)
@btime myprod($Pxy,$Sxy,$Lx) @assert Pxyref≈Pxy end nx=100 ny=100 tprod(nx,ny)  3 Likes Multiplying a matrix by all of the columns is equivalent to the matrix–matrix multiplication Lx*Sxy. So, just do mul!(Pxy, Lx, Sxy)  to use the pre-allocated output matrix Pxy. No views or loops required. The memory allocation here seems excessive, and it looks like may arise because the mul! method for SymTridiagonal in the standard library was unnecessarily restricted to act on only a few matrix types. Should be fixed here: 8 Likes Thank you very much ! I did not realized that it can be seen as a Matrix-Matrix product The timings improve and the allocations disappear but it is still slower than the loop based version 3.879 ms (0 allocations: 0 bytes) #myprod_basic 8.899 ms (200000 allocations: 9.16 MiB) #myprod 7.663 ms (0 allocations: 0 bytes) #myprod SJ with mul!(Pxy, Lx, Sxy)  Looking at the native code, it looks that the latter fails to vectorize. Thank you for the mul! fix ! Follows the updated MWE. Summary using BenchmarkTools using LinearAlgebra function myprod_basic(Pxy,Sxy,Lx) nx,ny=size(Pxy) d=Lx.dv e=Lx.ev for it=1:1000 # Threads.@threads for j=1:ny for j=1:ny @inbounds Pxy[1,j]=d*Sxy[1,j]+e*Sxy[2,j] for i=2:nx-1 @inbounds Pxy[i,j]=d[i]*Sxy[i,j]+e[i-1]*Sxy[i-1,j]+e[i]*Sxy[i+1,j] end @inbounds Pxy[nx,j]=d[nx]*Sxy[nx,j]+e[nx-1]*Sxy[nx-1,j] end end end function myprod(Pxy,Sxy,Lx) nx,ny=size(Pxy) for it=1:1000 @inbounds for j=1:ny @views mul!(Pxy[:,j],Lx,Sxy[:,j]) end end end function myprod2(Pxy,Sxy,Lx) for it=1:1000 mul!(Pxy,Lx,Sxy) end end function tprod(nx,ny) dv=rand(nx) ev=rand(nx) Lx=SymTridiagonal(dv,ev) Pxy=zeros(nx,ny) Pxyref=zeros(nx,ny) Sxy=rand(nx,ny) pj=Pxy[:,1] sj=Sxy[:,1] @btime mul!($pj,$Lx,$sj)
@btime myprod_basic($Pxyref,$Sxy,$Lx) @btime myprod($Pxy,$Sxy,$Lx)

@assert Pxyref≈Pxy

@btime myprod2($Pxy,$Sxy,\$Lx)
@assert Pxyref≈Pxy

end

nx=100
ny=100
tprod(nx,ny)


That’s interesting. It seems the julia version (in tridiag.jl) tries to minimize the memory accesses, which interferes with the vectorization. The code is pretty old (@andreasnoack in 2014), so things might have been different then. Looks like you can make a PR with your version and improve things for everybody! Possibly the same applies to the multiplications in bidiag.jl also.

2 Likes

Well my biggest concern was about the allocations.
Before a SymTridiagonal mul! PR I should think more about it…

• First: my loop based version lacks some corner case tests like :
 d=Lx.dv
e=Lx.ev
nx>0 && @inbounds Pxy[1,j]=d*Sxy[1,j]+e*Sxy[2,j]
for i=2:nx-1
@inbounds Pxy[i,j]=d[i]*Sxy[i,j]+e[i-1]*Sxy[i-1,j]+e[i]*Sxy[i+1,j]
end
nx>1 && @inbounds Pxy[nx,j]=d[nx]*Sxy[nx,j]+e[nx-1]*Sxy[nx-1,j]


The 2 extra tests do not seem to cause a noticeable overhead for the 100x100 case.

• Second: I have discussed with Andreas about the opportunity to add a more general matrix type like
SymNBanded{T,N} where N would stand for the matrix halfbandwidth (N=1 for Tridiagonal).
In this case, the product and the solvers (LDLT factorization) can be written once for tri/penta/hepta…diagonal matrices (for higher order discretization schemes).

A few months ago, @ffevotte and I have written such a (proto-)type (François took good care of the unrolling macros) but we still postponed a PR proposal because we (especially I) are still climbing the Julia learning curve Julia is very easy to enter and makes you want to write scientific software like no other. Nevertheless, it takes time to know enough about Julia to propose library with a quality that can match with existing ones…

2 Likes

I’m not sure the StridedVecOrMat restriction applies here. When I try

v = @views Pxy[:,1]
v isa LinearAlgebra.StridedVecOrMat # true
@which mul!(v, Lx, v) # returns the method in tridiag.jl, line 166


So there is no fallback acting. Why would a relaxation to AbstractVecOrMat avoid the allocations?

Right, sorry: that was a red herring, since StridedVecOrMat already includes subarrays.

You might find BandedMatrices.jl, which supports Symmetric{T,<:BandedMatrix} .

I think the allocations are due to generating views. If you divide the allocated memory by the number of allocations, then it’s just 48 bytes per allocation.

I think that the fixed bandwidth N defined in the type is important for performance.

Perhaps, especially with OpenBLAS whose Banded routines are dreadfully slow. But a new Banded type would make a good PR, and you get all the default banded implementations for free.

1 Like

Yes, extend BandedMatrices.jl seems logical.

I don’t know what has changed recently, but I’m not seeing any allocations in any of the three methods in Julia v1.3, in contrast to v1.2 and below. Anyway, the number of allocations seems to indicate that this is indeed due to the generation of views exclusively. There are 200,000 allocations, which equals 2 (for the 2 views in each mul! call) times 100 (for going through the 100 columns) times 1000 (for the number of iterations in myprod).

1 Like