SymTrididiagonal matrices performance and views allocations

LaurentPlagne · May 26, 2019, 8:30am

Hi,

I work on a toy 2D CFD solver and I wonder about the idiomatic Julian way to compute efficiently the product for all j \in [1:ny]

Pxy[:,j]=Lx*Sxy[:j]

where Lx is a a SymTridiagonal matrix of rank nx and Pxy and Sxy are two 2D arrays of size (nx,ny).

I wrote a loop based function myprod_basic which seems to perform significantly faster than myprod. In addition myprod allocates memory:

 
 3.858 ms (0 allocations: 0 bytes) #myprod_basic
 8.958 ms (200000 allocations: 9.16 MiB) #myprod

Note that this snippet may eventually be part of some material for a Julia hands-on session.
Thank you for your help.

Here is the MWE:

using BenchmarkTools
using LinearAlgebra
function myprod_basic(Pxy,Sxy,Lx)
    nx,ny=size(Pxy)

    d=Lx.dv
    e=Lx.ev

    for it=1:1000
        # Threads.@threads  for j=1:ny
        for j=1:ny
            @inbounds Pxy[1,j]=d[1]*Sxy[1,j]+e[1]*Sxy[2,j]
            for i=2:nx-1
                @inbounds Pxy[i,j]=d[i]*Sxy[i,j]+e[i-1]*Sxy[i-1,j]+e[i]*Sxy[i+1,j]
            end
            @inbounds Pxy[nx,j]=d[nx]*Sxy[nx,j]+e[nx-1]*Sxy[nx-1,j]
        end
    end
end


function myprod(Pxy,Sxy,Lx)
    nx,ny=size(Pxy)
    for it=1:1000
        @inbounds for j=1:ny
            @views mul!(Pxy[:,j],Lx,Sxy[:,j])
        end
    end
end

function tprod(nx,ny)
    dv=rand(nx)
    ev=rand(nx)
    Lx=SymTridiagonal(dv,ev)

    Pxy=zeros(nx,ny)
    Pxyref=zeros(nx,ny)
    Sxy=rand(nx,ny)

    pj=Pxy[:,1]
    sj=Sxy[:,1]

    @btime mul!($pj,$Lx,$sj)
    @btime myprod_basic($Pxyref,$Sxy,$Lx)
    @btime myprod($Pxy,$Sxy,$Lx)

    @assert Pxyref≈Pxy

end


nx=100
ny=100
tprod(nx,ny)

stevengj · May 26, 2019, 11:23am

Multiplying a matrix by all of the columns is equivalent to the matrix–matrix multiplication Lx*Sxy. So, just do

mul!(Pxy, Lx, Sxy)

to use the pre-allocated output matrix Pxy. No views or loops required.

The memory allocation here seems excessive, and it looks like may arise because the mul! method for SymTridiagonal in the standard library was unnecessarily restricted to act on only a few matrix types. Should be fixed here:

github.com/JuliaLang/julia

AbstractVecOrMat, not StridedVecOrMat, in tridiagonal methods

JuliaLang:master ← JuliaLang:stevengj-patch-4

opened 11:29AM - 26 May 19 UTC

stevengj

+3 -3

It looks like several of the methods for tridiagonal methods were unnecessarily …restricted to act on `StridedVecOrMat` rather than `AbstractVecOrMat`. I noticed this because someone reported that `mul!` with a view was still allocating lots of memory (https://discourse.julialang.org/t/symtrididiagonal-matrices-performance-and-views-allocations)

LaurentPlagne · May 26, 2019, 2:54pm

Thank you very much !
I did not realized that it can be seen as a Matrix-Matrix product

The timings improve and the allocations disappear but it is still slower than the loop based version

3.879 ms (0 allocations: 0 bytes)   #myprod_basic
8.899 ms (200000 allocations: 9.16 MiB)  #myprod
7.663 ms (0 allocations: 0 bytes)  #myprod SJ with mul!(Pxy, Lx, Sxy)

Looking at the native code, it looks that the latter fails to vectorize.

Thank you for the mul! fix !

Follows the updated MWE.

Summary

using BenchmarkTools
using LinearAlgebra
function myprod_basic(Pxy,Sxy,Lx)
    nx,ny=size(Pxy)

    d=Lx.dv
    e=Lx.ev

    for it=1:1000
        # Threads.@threads  for j=1:ny
        for j=1:ny
            @inbounds Pxy[1,j]=d[1]*Sxy[1,j]+e[1]*Sxy[2,j]
            for i=2:nx-1
                @inbounds Pxy[i,j]=d[i]*Sxy[i,j]+e[i-1]*Sxy[i-1,j]+e[i]*Sxy[i+1,j]
            end
            @inbounds Pxy[nx,j]=d[nx]*Sxy[nx,j]+e[nx-1]*Sxy[nx-1,j]
        end
    end
end


function myprod(Pxy,Sxy,Lx)
    nx,ny=size(Pxy)
    for it=1:1000
        @inbounds for j=1:ny
            @views mul!(Pxy[:,j],Lx,Sxy[:,j])
        end
    end
end

function myprod2(Pxy,Sxy,Lx)
    for it=1:1000
        mul!(Pxy,Lx,Sxy)
    end
end

function tprod(nx,ny)
    dv=rand(nx)
    ev=rand(nx)
    Lx=SymTridiagonal(dv,ev)

    Pxy=zeros(nx,ny)
    Pxyref=zeros(nx,ny)
    Sxy=rand(nx,ny)

    pj=Pxy[:,1]
    sj=Sxy[:,1]

    @btime mul!($pj,$Lx,$sj)
    @btime myprod_basic($Pxyref,$Sxy,$Lx)
    @btime myprod($Pxy,$Sxy,$Lx)

    @assert Pxyref≈Pxy

    @btime myprod2($Pxy,$Sxy,$Lx)
    @assert Pxyref≈Pxy

end


nx=100
ny=100
tprod(nx,ny)

antoine-levitt · May 26, 2019, 3:26pm

That’s interesting. It seems the julia version (in tridiag.jl) tries to minimize the memory accesses, which interferes with the vectorization. The code is pretty old (@andreasnoack in 2014), so things might have been different then. Looks like you can make a PR with your version and improve things for everybody! Possibly the same applies to the multiplications in bidiag.jl also.

LaurentPlagne · May 26, 2019, 4:06pm

Well my biggest concern was about the allocations.
Before a SymTridiagonal mul! PR I should think more about it…

First: my loop based version lacks some corner case tests like :

 d=Lx.dv
 e=Lx.ev         
 nx>0 && @inbounds Pxy[1,j]=d[1]*Sxy[1,j]+e[1]*Sxy[2,j]
 for i=2:nx-1
      @inbounds Pxy[i,j]=d[i]*Sxy[i,j]+e[i-1]*Sxy[i-1,j]+e[i]*Sxy[i+1,j]
 end
 nx>1 && @inbounds Pxy[nx,j]=d[nx]*Sxy[nx,j]+e[nx-1]*Sxy[nx-1,j]

The 2 extra tests do not seem to cause a noticeable overhead for the 100x100 case.

Second: I have discussed with Andreas about the opportunity to add a more general matrix type like
SymNBanded{T,N} where N would stand for the matrix halfbandwidth (N=1 for Tridiagonal).
In this case, the product and the solvers (LDLT factorization) can be written once for tri/penta/hepta…diagonal matrices (for higher order discretization schemes).

A few months ago, @ffevotte and I have written such a (proto-)type (François took good care of the unrolling macros) but we still postponed a PR proposal because we (especially I) are still climbing the Julia learning curve

Julia is very easy to enter and makes you want to write scientific software like no other. Nevertheless, it takes time to know enough about Julia to propose library with a quality that can match with existing ones…

dkarrasch · May 26, 2019, 11:08pm

I’m not sure the StridedVecOrMat restriction applies here. When I try

v = @views Pxy[:,1]
v isa LinearAlgebra.StridedVecOrMat # true
@which mul!(v, Lx, v) # returns the method in tridiag.jl, line 166

So there is no fallback acting. Why would a relaxation to AbstractVecOrMat avoid the allocations?

stevengj · May 27, 2019, 12:15am

Right, sorry: that was a red herring, since StridedVecOrMat already includes subarrays.

dlfivefifty · May 27, 2019, 5:47am

You might find BandedMatrices.jl, which supports Symmetric{T,<:BandedMatrix} .

dkarrasch · May 27, 2019, 5:52am

I think the allocations are due to generating views. If you divide the allocated memory by the number of allocations, then it’s just 48 bytes per allocation.

LaurentPlagne · May 27, 2019, 6:08am

I think that the fixed bandwidth N defined in the type is important for performance.

dlfivefifty · May 27, 2019, 6:16am

Perhaps, especially with OpenBLAS whose Banded routines are dreadfully slow. But a new Banded type would make a good PR, and you get all the default banded implementations for free.

LaurentPlagne · May 27, 2019, 6:23am

Yes, extend BandedMatrices.jl seems logical.

dkarrasch · November 8, 2019, 4:54pm

I don’t know what has changed recently, but I’m not seeing any allocations in any of the three methods in Julia v1.3, in contrast to v1.2 and below. Anyway, the number of allocations seems to indicate that this is indeed due to the generation of views exclusively. There are 200,000 allocations, which equals 2 (for the 2 views in each mul! call) times 100 (for going through the 100 columns) times 1000 (for the number of iterations in myprod).

Topic		Replies	Views
Faster quadratic expression for symmetric matrices Performance question , linearalgebra	12	790	December 27, 2022
Allocation when applying in place multiplication to matrix of SVector Performance question	2	190	March 15, 2023
How to efficiently do a martix multiplication? New to Julia question	7	557	June 3, 2020
Batched matrix-multiplication optimization Performance performance , linearalgebra	15	349	May 30, 2025
Matrix-by-(slice of)vector multiplication with limited allocation New to Julia question	6	816	September 11, 2020

SymTrididiagonal matrices performance and views allocations

Related topics