Speed of ldiv! for banded matrices in BandedMatrices.jl

BandMatrices.jl is innocent, I think. Profiling it, nearly all the time is spent in the LAPACK routines dgbtrs and sgbtrs. So I think this is just the way LAPACK performs on these problems. If you run it with substantially larger bandwidths (e.g. 100 instead of 10), I think you will find that single precision starts to pull ahead. These routines are not using blocking, so your timing is going to be dominated by memory access.

The code from dgbtrs that does forward substitution using L is

         IF( LNOTI ) THEN
            DO 10 J = 1, N - 1
               LM = MIN( KL, N-J )
               L = IPIV( J )
               IF( L.NE.J )
     $            CALL DSWAP( NRHS, B( L, 1 ), LDB, B( J, 1 ), LDB )
               CALL DGER( LM, NRHS, -ONE, AB( KD+1, J ), 1, B( J,
     $                    1 ),
     $                    LDB, B( J+1, 1 ), LDB )
   10       CONTINUE

The code for single precision is similar. The routine DSWAP exchanges two rows, traversing those rows when the matrix is stored in column major order, which is slow. Note that the right-hand side is B and the LU factorization is in band storage in AB.

Edit: I went out on a limb and I had a guess on what was going on that on reflection doesn’t seem very plausible to me with a single right-hand side. So I’ll speculate a bit less and just say that this is a small matrix in the sense that the band storage for the LU factorization has 31 rows. There may be something interesting going on with memory access, but LAPACK isn’t known for being fast on small matrices and there could also be some fixed overhead that is independent of the precision. The fact that single precision is slower than you expected relative to double precision on such a small bandwidth is not very surprising. It would be interesting to try to profile the Fortran code.

2 Likes