SymTrididiagonal matrices performance and views allocations

Hi,

I work on a toy 2D CFD solver and I wonder about the idiomatic Julian way to compute efficiently the product for all j \in [1:ny]

Pxy[:,j]=Lx*Sxy[:j]

where Lx is a a SymTridiagonal matrix of rank nx and Pxy and Sxy are two 2D arrays of size (nx,ny).

I wrote a loop based function myprod_basic which seems to perform significantly faster than myprod. In addition myprod allocates memory:

 
 3.858 ms (0 allocations: 0 bytes) #myprod_basic
 8.958 ms (200000 allocations: 9.16 MiB) #myprod

Note that this snippet may eventually be part of some material for a Julia hands-on session.
Thank you for your help.

Here is the MWE:

using BenchmarkTools
using LinearAlgebra
function myprod_basic(Pxy,Sxy,Lx)
    nx,ny=size(Pxy)

    d=Lx.dv
    e=Lx.ev

    for it=1:1000
        # Threads.@threads  for j=1:ny
        for j=1:ny
            @inbounds Pxy[1,j]=d[1]*Sxy[1,j]+e[1]*Sxy[2,j]
            for i=2:nx-1
                @inbounds Pxy[i,j]=d[i]*Sxy[i,j]+e[i-1]*Sxy[i-1,j]+e[i]*Sxy[i+1,j]
            end
            @inbounds Pxy[nx,j]=d[nx]*Sxy[nx,j]+e[nx-1]*Sxy[nx-1,j]
        end
    end
end


function myprod(Pxy,Sxy,Lx)
    nx,ny=size(Pxy)
    for it=1:1000
        @inbounds for j=1:ny
            @views mul!(Pxy[:,j],Lx,Sxy[:,j])
        end
    end
end

function tprod(nx,ny)
    dv=rand(nx)
    ev=rand(nx)
    Lx=SymTridiagonal(dv,ev)

    Pxy=zeros(nx,ny)
    Pxyref=zeros(nx,ny)
    Sxy=rand(nx,ny)

    pj=Pxy[:,1]
    sj=Sxy[:,1]

    @btime mul!($pj,$Lx,$sj)
    @btime myprod_basic($Pxyref,$Sxy,$Lx)
    @btime myprod($Pxy,$Sxy,$Lx)

    @assert Pxyref≈Pxy

end


nx=100
ny=100
tprod(nx,ny)
2 Likes

Multiplying a matrix by all of the columns is equivalent to the matrix–matrix multiplication Lx*Sxy. So, just do

mul!(Pxy, Lx, Sxy)

to use the pre-allocated output matrix Pxy. No views or loops required.

The memory allocation here seems excessive, and it looks like may arise because the mul! method for SymTridiagonal in the standard library was unnecessarily restricted to act on only a few matrix types. Should be fixed here:

5 Likes

Thank you very much !
I did not realized that it can be seen as a Matrix-Matrix product :blush:

The timings improve and the allocations disappear but it is still slower than the loop based version

3.879 ms (0 allocations: 0 bytes)   #myprod_basic
8.899 ms (200000 allocations: 9.16 MiB)  #myprod
7.663 ms (0 allocations: 0 bytes)  #myprod SJ with mul!(Pxy, Lx, Sxy)

Looking at the native code, it looks that the latter fails to vectorize.

Thank you for the mul! fix !

Follows the updated MWE.

Summary
using BenchmarkTools
using LinearAlgebra
function myprod_basic(Pxy,Sxy,Lx)
    nx,ny=size(Pxy)

    d=Lx.dv
    e=Lx.ev

    for it=1:1000
        # Threads.@threads  for j=1:ny
        for j=1:ny
            @inbounds Pxy[1,j]=d[1]*Sxy[1,j]+e[1]*Sxy[2,j]
            for i=2:nx-1
                @inbounds Pxy[i,j]=d[i]*Sxy[i,j]+e[i-1]*Sxy[i-1,j]+e[i]*Sxy[i+1,j]
            end
            @inbounds Pxy[nx,j]=d[nx]*Sxy[nx,j]+e[nx-1]*Sxy[nx-1,j]
        end
    end
end


function myprod(Pxy,Sxy,Lx)
    nx,ny=size(Pxy)
    for it=1:1000
        @inbounds for j=1:ny
            @views mul!(Pxy[:,j],Lx,Sxy[:,j])
        end
    end
end

function myprod2(Pxy,Sxy,Lx)
    for it=1:1000
        mul!(Pxy,Lx,Sxy)
    end
end

function tprod(nx,ny)
    dv=rand(nx)
    ev=rand(nx)
    Lx=SymTridiagonal(dv,ev)

    Pxy=zeros(nx,ny)
    Pxyref=zeros(nx,ny)
    Sxy=rand(nx,ny)

    pj=Pxy[:,1]
    sj=Sxy[:,1]

    @btime mul!($pj,$Lx,$sj)
    @btime myprod_basic($Pxyref,$Sxy,$Lx)
    @btime myprod($Pxy,$Sxy,$Lx)

    @assert Pxyref≈Pxy

    @btime myprod2($Pxy,$Sxy,$Lx)
    @assert Pxyref≈Pxy

end


nx=100
ny=100
tprod(nx,ny)

That’s interesting. It seems the julia version (in tridiag.jl) tries to minimize the memory accesses, which interferes with the vectorization. The code is pretty old (@andreasnoack in 2014), so things might have been different then. Looks like you can make a PR with your version and improve things for everybody! Possibly the same applies to the multiplications in bidiag.jl also.

2 Likes

Well my biggest concern was about the allocations.
Before a SymTridiagonal mul! PR I should think more about it…

  • First: my loop based version lacks some corner case tests like :
 d=Lx.dv
 e=Lx.ev         
 nx>0 && @inbounds Pxy[1,j]=d[1]*Sxy[1,j]+e[1]*Sxy[2,j]
 for i=2:nx-1
      @inbounds Pxy[i,j]=d[i]*Sxy[i,j]+e[i-1]*Sxy[i-1,j]+e[i]*Sxy[i+1,j]
 end
 nx>1 && @inbounds Pxy[nx,j]=d[nx]*Sxy[nx,j]+e[nx-1]*Sxy[nx-1,j]

The 2 extra tests do not seem to cause a noticeable overhead for the 100x100 case.

  • Second: I have discussed with Andreas about the opportunity to add a more general matrix type like
    SymNBanded{T,N} where N would stand for the matrix halfbandwidth (N=1 for Tridiagonal).
    In this case, the product and the solvers (LDLT factorization) can be written once for tri/penta/hepta…diagonal matrices (for higher order discretization schemes).

A few months ago, @ffevotte and I have written such a (proto-)type (François took good care of the unrolling macros) but we still postponed a PR proposal because we (especially I) are still climbing the Julia learning curve :wink:

Julia is very easy to enter and makes you want to write scientific software like no other. Nevertheless, it takes time to know enough about Julia to propose library with a quality that can match with existing ones…

2 Likes

I’m not sure the StridedVecOrMat restriction applies here. When I try

v = @views Pxy[:,1]
v isa LinearAlgebra.StridedVecOrMat # true
@which mul!(v, Lx, v) # returns the method in tridiag.jl, line 166

So there is no fallback acting. Why would a relaxation to AbstractVecOrMat avoid the allocations?

Right, sorry: that was a red herring, since StridedVecOrMat already includes subarrays.

You might find BandedMatrices.jl, which supports Symmetric{T,<:BandedMatrix} .

I think the allocations are due to generating views. If you divide the allocated memory by the number of allocations, then it’s just 48 bytes per allocation.

I think that the fixed bandwidth N defined in the type is important for performance.

Perhaps, especially with OpenBLAS whose Banded routines are dreadfully slow. But a new Banded type would make a good PR, and you get all the default banded implementations for free.

1 Like

Yes, extend BandedMatrices.jl seems logical.