Speed of ldiv! for banded matrices in BandedMatrices.jl

Theoretically, solving banded matrix stored in Float32 should be twice faster than that stored in Float64. But this is not verified in my codes using BandedMatrices.jl. (Note that creating a random vector costs far less than solving a banded matrix.)

n = 1000;
a = brand(Float64, n, n, 10, 10);
b = brand(Float32, n, n, 10, 10);
copyto!(b, a);
lua = lu(a);
lub = lu(b);
@btime ldiv!(lua, rand(n));
@btime ldiv!(lub, rand(Float32, n));

The results are

29.200 μs (3 allocations: 7.88 KiB)
31.800 μs (3 allocations: 4.00 KiB)

which means that solving banded matrices in Float32 is slower. I could not figure it out. Meanwhile, if the banded matrices are changes to dense matrices, the results are correct.

n = 1000;
a = rand(Float64, n, n);
b = rand(Float32, n, n);
copyto!(b, a);
lua = lu(a);
lub = lu(b);
@btime ldiv!(lua, rand(n));
@btime ldiv!(lub, rand(Float32, n));

The outputs are

170.900 μs (3 allocations: 7.88 KiB)
90.900 μs (3 allocations: 4.00 KiB)

It seems that there is something wrong with BandedMatices.jl. Is there any help?

1 Like

BandMatrices.jl is innocent, I think. Profiling it, nearly all the time is spent in the LAPACK routines dgbtrs and sgbtrs. So I think this is just the way LAPACK performs on these problems. If you run it with substantially larger bandwidths (e.g. 100 instead of 10), I think you will find that single precision starts to pull ahead. These routines are not using blocking, so your timing is going to be dominated by memory access.

The code from dgbtrs that does forward substitution using L is

         IF( LNOTI ) THEN
            DO 10 J = 1, N - 1
               LM = MIN( KL, N-J )
               L = IPIV( J )
               IF( L.NE.J )
     $            CALL DSWAP( NRHS, B( L, 1 ), LDB, B( J, 1 ), LDB )
               CALL DGER( LM, NRHS, -ONE, AB( KD+1, J ), 1, B( J,
     $                    1 ),
     $                    LDB, B( J+1, 1 ), LDB )
   10       CONTINUE

The code for single precision is similar. The routine DSWAP exchanges two rows, traversing those rows when the matrix is stored in column major order, which is slow. Note that the right-hand side is B and the LU factorization is in band storage in AB.

Edit: I went out on a limb and I had a guess on what was going on that on reflection doesn’t seem very plausible to me with a single right-hand side. So I’ll speculate a bit less and just say that this is a small matrix in the sense that the band storage for the LU factorization has 31 rows. There may be something interesting going on with memory access, but LAPACK isn’t known for being fast on small matrices and there could also be some fixed overhead that is independent of the precision. The fact that single precision is slower than you expected relative to double precision on such a small bandwidth is not very surprising. It would be interesting to try to profile the Fortran code.

2 Likes

Thank you for your detailed explanation. As you suggested, I tried larger bandwidths 100 and the results are close to expectation, where the time needed for single precision is half than double precision.

70.900 μs (1 allocation: 7.94 KiB)
39.600 μs (1 allocation: 4.06 KiB)

BandedMatrices.jl just utilizes LAPACK standard routines and the memory access is the bottleneck when the bandwidth is small. Anyway, this makes sense to me.

1 Like