Help improving the performance of my implementation of a lower diagonal storage

Ronis_BR · October 3, 2025, 9:12pm

Hi!

I am trying to build a lower diagonal storage to reduce the memory footprint of some applications. In my case, when computing spherical harmonics, the coefficients are usual placed in a lower triangular matrix. This is useful for many applications like computing the geomagnetic field and gravity acceleration.

My current implementation is:

abstract type AbstractDataAlignment end
struct RowMajor <: AbstractDataAlignment end
struct ColumnMajor <: AbstractDataAlignment end

struct LowerTriangularStorage{Ta, Tt} <: AbstractMatrix{Tt}
    @static if VERSION >= v"1.11-"
        data::Memory{Tt}
    else
        data::Vector{Tt}
    end

    n::Int

    function LowerTriangularStorage{Ta, Tt}(n::Int) where {Ta<:AbstractDataAlignment, Tt}
        n < 1 && throw(ArgumentError("Matrix size must be positive"))
        len  = (n * (n + 1)) ÷ 2

        data = @static if VERSION >= v"1.11-"
            Memory{Tt}(undef, len)
        else
            Vector{Tt}(undef, len)
        end

        new{Ta, Tt}(data, n)
    end

    function LowerTriangularStorage{Ta}(n::Int) where Ta <: AbstractDataAlignment
        return LowerTriangularStorage{Ta, Float64}(n)
    end

    function LowerTriangularStorage{T}(n::Int) where T
        return LowerTriangularStorage{ColumnMajor, T}(n)
    end

    function LowerTriangularStorage(n::Int)
        return LowerTriangularStorage{ColumnMajor, Float64}(n)
    end
end

@inline Base.@propagate_inbounds function Base.getindex(
    L::LowerTriangularStorage{Ta, Tt},
    i::Int,
    j::Int
) where {Ta <: AbstractDataAlignment, Tt}
    Base.@boundscheck (i < 1 || j < 1 || i > L.n || j > L.n || j > i) &&
        Base.throw_boundserror(L, (i, j))

    return (@inbounds L.data[_axes_to_index(L, i, j)])
end

@inline Base.@propagate_inbounds function Base.setindex!(
    L::LowerTriangularStorage{Ta, Tt},
    v,
    i::Int,
    j::Int
) where {Ta <: AbstractDataAlignment, Tt}
    Base.@boundscheck (i < 1 || j < 1 || i > L.n || j > L.n || j > i) &&
        Base.throw_boundserror(L, (i, j))

    @inbounds L.data[_axes_to_index(L, i, j)] = v
    return L
end

Base.size(L::LowerTriangularStorage) = (L.n, L.n)

function Base.zeros(::Type{T}, n::Int) where T <: LowerTriangularStorage
    L = T(n)
    L.data .= zero(eltype(L.data))
    return L
end

function Base.replace_in_print_matrix(::LowerTriangularStorage, i::Int, j::Int, s::AbstractString)
    i >= j && return s
    s == "#undef" && return "  ×" * repeat(" ", length(s) - 3)
    return repeat(" ", length(s))
end

@inline function _axes_to_index(::LowerTriangularStorage{RowMajor}, i::Integer, j::Integer)
    return (i * (i - 1)) ÷ 2 + j
end

@inline function _axes_to_index(L::LowerTriangularStorage{ColumnMajor}, i::Integer, j::Integer)
    return ((j - 1) * (2L.n - j) + 2i) ÷ 2
end

For very large matrices, we have a huge gain (50%) as we can see by computing the Legendre coefficients:

julia> using SatelliteToolboxLegendre, BenchmarkTools

julia> P = LowerTriangularStorage{RowMajor}(2200);

julia> dP = LowerTriangularStorage{RowMajor}(2200);

julia> Pm = zeros(2200, 2200);

julia> dPm = zeros(2200, 2200);

julia> legendre!(Val(:full), P, 0.4)

julia> legendre!(Val(:full), Pm, 0.4)

julia> @benchmark dlegendre!(Val(:full), $dPm, 0.4, $Pm)
BenchmarkTools.Trial: 574 samples with 1 evaluation per sample.
 Range (min … max):  8.503 ms …  10.539 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.602 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.719 ms ± 290.209 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▇█▆▃▄▄▂▂▁                                                  
  ▄██████████▆▆█▆▇▇▇▄▅▅▇██▆▄▁▄▄▅▅▄▆▁▁▅▅▅▇▁▁▄▆▁▁▆▅▄▅▆▄▄▁▄▁▆▁▄▄ ▇
  8.5 ms       Histogram: log(frequency) by time      9.85 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark dlegendre!(Val(:full), $dP, 0.4, $P)
BenchmarkTools.Trial: 1172 samples with 1 evaluation per sample.
 Range (min … max):  4.156 ms …  5.287 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.253 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.267 ms ± 83.894 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▃▆▃▆▅█▃                                              
  ▃▄▃▃▄▅█████████▆▅▅▄▄▄▃▄▃▃▃▂▂▂▂▂▂▂▁▁▁▁▂▁▂▁▂▂▁▁▁▁▁▂▂▁▂▂▂▂▁▂▂ ▃
  4.16 ms        Histogram: frequency by time        4.65 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, for small matrices, the new implementation leads to a performance degradation:

julia> using SatelliteToolboxLegendre, BenchmarkTools

julia> P = LowerTriangularStorage{RowMajor}(20);

julia> dP = LowerTriangularStorage{RowMajor}(20);

julia> Pm = zeros(20, 20);

julia> dPm = zeros(20, 20);

julia> @benchmark dlegendre!(Val(:full), $dPm, 0.4, $Pm)
BenchmarkTools.Trial: 10000 samples with 285 evaluations per sample.
 Range (min … max):  282.600 ns … 514.326 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     283.042 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   284.933 ns ±   7.142 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▂                   ▃▁                                      ▁
  ████▇▆▆▅▅▅▄▄▃▄▄▃▃▃▃▁▃▇██▆▆█▅▆█▆▆▅▆▃▅▆▅▅▅▅▆▆▇█▆▅▆▇▅▅▇▇▅▅▅▅▄▅▅▆ █
  283 ns        Histogram: log(frequency) by time        316 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark dlegendre!(Val(:full), $dP, 0.4, $P)
BenchmarkTools.Trial: 10000 samples with 214 evaluations per sample.
 Range (min … max):  349.103 ns … 748.051 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     351.827 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   356.279 ns ±  13.982 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▅█▅▁           ▃▂▂▂                                          ▁
  █████▆▄▅▅▅▆▅▆▄▄▄███████▆▇▆▆▇▆▆▆▆▆▇▇██▇█▇▇█▇▄▅▆▆▅▆▄▆▅▅▃▅▄▄▄▄▆▅ █
  349 ns        Histogram: log(frequency) by time        408 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I am wondering if there is anything I can do to avoid this performance degradation on small matrices so that I can use this new storage type by default.

lmiq · October 3, 2025, 10:06pm

Make the size a type parameter and dispatch to whatever is faster?

DanimirD · October 3, 2025, 10:18pm

You could check @code_warntype for this line. I guess there is a few things that could go wrong

Ronis_BR · October 3, 2025, 10:25pm

It can be a solution, but I want to understand what I am doing wrong. I really expect that the performance will increase in any case due to the caching.

I checked and it is type stable.

Benny · October 3, 2025, 10:28pm

I’d probably go with a runtime branch instead for easier type stability and reduced compilation, but figuring out the cutoff isn’t straightforward. It could easily vary by platform. Considering the slowdown is a modest 0.806x at sub-microsecond scales, I question if it’s even worth the effort, especially since there is doubt about the implementation.

JET and Cthulhu are recursive, former automatically and the latter manually.

Ronis_BR · October 3, 2025, 10:29pm

Usually this code is called inside a simulator many times per integration step. Hence, this minor degradation can lead to a substantial slowdown.

stevengj · October 3, 2025, 11:02pm

See also the discussion in this thread: Triangular Arrays in julia? - #27 by mkborregaard

My feeling is that the factor of 2 storage benefit is not going to be worth it for the performance degradation of giving up BLAS, if you want to do linear algebra (e.g. triangular solves) with the matrix. YMMV depending on what you want to do with the array.

Ronis_BR · October 3, 2025, 11:09pm

Actually I will use it only as storage. Thanks for the thread!

Ronis_BR · October 4, 2025, 2:19pm

Now I saw something very strange and I really want to know why

The original code for the derivative of the Legendre coefficients was:

            if m == 0
                ...
            elseif m == 1
                ...
            elseif n != m
                a_nm = +√(T(n + m) * T(n - m + 1)) / 2
                b_nm = -√(T(n + m + 1) * T(n - m)) / 2

                dP_nm = a_nm * P[i₀ + n, j₀ + m - 1] + b_nm * P[i₀ + n, j₀ + m + 1]

            else
                a_nm = +√(T(n + m) * T(n - m + 1)) / 2

                dP_nm = a_nm * P[i₀ + n, j₀ + m - 1]
            end

Now, if I change to:

            if m == 0
                ...
            elseif m == 1
                ...
            else
                a_nm  = +√(T(n + m) * T(n - m + 1)) / 2
                dP_nm = a_nm * P[i₀ + n, j₀ + m - 1]

                if n != m
                    b_nm   = -√(T(n + m + 1) * T(n - m)) / 2
                    dP_nm += b_nm * P[i₀ + n, j₀ + m + 1]
                end
            end

The performance of both implementations are equal for small matrices. Is this suppose to happen?

Ronis_BR · October 4, 2025, 9:01pm

After a lot of tests, this simple change in the if allowed the new triangular storage to be faster in all scenarios. Its performance computing very big degrees is even better. I just hope that a compiler expert can tell me why this happened I will learn a very useful experience for other scenarios.

nsajko · October 4, 2025, 9:10pm

I don’t understand the question. Probably you want to provide a reproducer. Also maybe open a new topic.

Benny · October 4, 2025, 9:48pm

Can’t say much without something to tinker with, but it’s plausible at first glance that changed some code motion optimization or compacted the routine to fit in the instruction cache a bit better.

Ronis_BR · October 4, 2025, 9:58pm

The problem is that the only way to actually reproduce the problem is copying that definition of the LowerTriangularStorage and execute it together with SatelliteToolboxLegendre.jl as I showed in the first post.

stevengj · October 5, 2025, 1:17am

I’m skeptical that this is really the only way to see what is going on. But it can take a fair amount of effort to reduce a problem to a minimal working example that illustrates the issue.

Ronis_BR · October 5, 2025, 1:32am

I completely agree! I meant that I was unable to reproduce the problem in any other way I will keep trying.

mikmoore · October 6, 2025, 3:29pm

Another interesting packed format is Recangular Full Packed (see RectangularFullPacked.jl). It splits a triangular matrix into a trapezoid and triangle, transposes one, and then stacks those into a rectangle. This means it uses the same space as triangular packed formats but has a more BLAS-friendly layout (it looks like it takes advantage of this for solves/inverses but not yet multiplication? could be a PR). Be sure to test the functionality you utilize as the package still seems a touch raw in a few spots.

Topic		Replies	Views
Triangular Arrays in julia? Offtopic	25	6603	June 7, 2020
Request guidance in optimising the performance of my implementation (running time and allocation) General Usage question	9	1027	April 30, 2018
Help to reduce number of allocations General Usage	23	5571	January 6, 2019
Potential small inefficiency in `triu!` and `tril!`? Performance	0	416	June 16, 2018
Matrix performance General Usage performance	24	1726	August 23, 2017

Help improving the performance of my implementation of a lower diagonal storage

Related topics