Help improving the performance of my implementation of a lower diagonal storage

Hi!

I am trying to build a lower diagonal storage to reduce the memory footprint of some applications. In my case, when computing spherical harmonics, the coefficients are usual placed in a lower triangular matrix. This is useful for many applications like computing the geomagnetic field and gravity acceleration.

My current implementation is:

abstract type AbstractDataAlignment end
struct RowMajor <: AbstractDataAlignment end
struct ColumnMajor <: AbstractDataAlignment end

struct LowerTriangularStorage{Ta, Tt} <: AbstractMatrix{Tt}
    @static if VERSION >= v"1.11-"
        data::Memory{Tt}
    else
        data::Vector{Tt}
    end

    n::Int

    function LowerTriangularStorage{Ta, Tt}(n::Int) where {Ta<:AbstractDataAlignment, Tt}
        n < 1 && throw(ArgumentError("Matrix size must be positive"))
        len  = (n * (n + 1)) Γ· 2

        data = @static if VERSION >= v"1.11-"
            Memory{Tt}(undef, len)
        else
            Vector{Tt}(undef, len)
        end

        new{Ta, Tt}(data, n)
    end

    function LowerTriangularStorage{Ta}(n::Int) where Ta <: AbstractDataAlignment
        return LowerTriangularStorage{Ta, Float64}(n)
    end

    function LowerTriangularStorage{T}(n::Int) where T
        return LowerTriangularStorage{ColumnMajor, T}(n)
    end

    function LowerTriangularStorage(n::Int)
        return LowerTriangularStorage{ColumnMajor, Float64}(n)
    end
end

@inline Base.@propagate_inbounds function Base.getindex(
    L::LowerTriangularStorage{Ta, Tt},
    i::Int,
    j::Int
) where {Ta <: AbstractDataAlignment, Tt}
    Base.@boundscheck (i < 1 || j < 1 || i > L.n || j > L.n || j > i) &&
        Base.throw_boundserror(L, (i, j))

    return (@inbounds L.data[_axes_to_index(L, i, j)])
end

@inline Base.@propagate_inbounds function Base.setindex!(
    L::LowerTriangularStorage{Ta, Tt},
    v,
    i::Int,
    j::Int
) where {Ta <: AbstractDataAlignment, Tt}
    Base.@boundscheck (i < 1 || j < 1 || i > L.n || j > L.n || j > i) &&
        Base.throw_boundserror(L, (i, j))

    @inbounds L.data[_axes_to_index(L, i, j)] = v
    return L
end

Base.size(L::LowerTriangularStorage) = (L.n, L.n)

function Base.zeros(::Type{T}, n::Int) where T <: LowerTriangularStorage
    L = T(n)
    L.data .= zero(eltype(L.data))
    return L
end

function Base.replace_in_print_matrix(::LowerTriangularStorage, i::Int, j::Int, s::AbstractString)
    i >= j && return s
    s == "#undef" && return "  Γ—" * repeat(" ", length(s) - 3)
    return repeat(" ", length(s))
end

@inline function _axes_to_index(::LowerTriangularStorage{RowMajor}, i::Integer, j::Integer)
    return (i * (i - 1)) Γ· 2 + j
end

@inline function _axes_to_index(L::LowerTriangularStorage{ColumnMajor}, i::Integer, j::Integer)
    return ((j - 1) * (2L.n - j) + 2i) Γ· 2
end

For very large matrices, we have a huge gain (50%) as we can see by computing the Legendre coefficients:

julia> using SatelliteToolboxLegendre, BenchmarkTools

julia> P = LowerTriangularStorage{RowMajor}(2200);

julia> dP = LowerTriangularStorage{RowMajor}(2200);

julia> Pm = zeros(2200, 2200);

julia> dPm = zeros(2200, 2200);

julia> legendre!(Val(:full), P, 0.4)

julia> legendre!(Val(:full), Pm, 0.4)

julia> @benchmark dlegendre!(Val(:full), $dPm, 0.4, $Pm)
BenchmarkTools.Trial: 574 samples with 1 evaluation per sample.
 Range (min … max):  8.503 ms …  10.539 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     8.602 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   8.719 ms Β± 290.209 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ƒβ–‡β–ˆβ–†β–ƒβ–„β–„β–‚β–‚β–                                                  
  β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–ˆβ–†β–‡β–‡β–‡β–„β–…β–…β–‡β–ˆβ–ˆβ–†β–„β–β–„β–„β–…β–…β–„β–†β–β–β–…β–…β–…β–‡β–β–β–„β–†β–β–β–†β–…β–„β–…β–†β–„β–„β–β–„β–β–†β–β–„β–„ β–‡
  8.5 ms       Histogram: log(frequency) by time      9.85 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark dlegendre!(Val(:full), $dP, 0.4, $P)
BenchmarkTools.Trial: 1172 samples with 1 evaluation per sample.
 Range (min … max):  4.156 ms …  5.287 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.253 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.267 ms Β± 83.894 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

         β–ƒβ–†β–ƒβ–†β–…β–ˆβ–ƒ                                              
  β–ƒβ–„β–ƒβ–ƒβ–„β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–…β–„β–„β–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–‚β–β–‚β–β–‚β–‚β–β–β–β–β–β–‚β–‚β–β–‚β–‚β–‚β–‚β–β–‚β–‚ β–ƒ
  4.16 ms        Histogram: frequency by time        4.65 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, for small matrices, the new implementation leads to a performance degradation:

julia> using SatelliteToolboxLegendre, BenchmarkTools

julia> P = LowerTriangularStorage{RowMajor}(20);

julia> dP = LowerTriangularStorage{RowMajor}(20);

julia> Pm = zeros(20, 20);

julia> dPm = zeros(20, 20);

julia> @benchmark dlegendre!(Val(:full), $dPm, 0.4, $Pm)
BenchmarkTools.Trial: 10000 samples with 285 evaluations per sample.
 Range (min … max):  282.600 ns … 514.326 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     283.042 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   284.933 ns Β±   7.142 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–†β–‚                   ▃▁                                      ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–…β–…β–…β–„β–„β–ƒβ–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–β–ƒβ–‡β–ˆβ–ˆβ–†β–†β–ˆβ–…β–†β–ˆβ–†β–†β–…β–†β–ƒβ–…β–†β–…β–…β–…β–…β–†β–†β–‡β–ˆβ–†β–…β–†β–‡β–…β–…β–‡β–‡β–…β–…β–…β–…β–„β–…β–…β–† β–ˆ
  283 ns        Histogram: log(frequency) by time        316 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark dlegendre!(Val(:full), $dP, 0.4, $P)
BenchmarkTools.Trial: 10000 samples with 214 evaluations per sample.
 Range (min … max):  349.103 ns … 748.051 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     351.827 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   356.279 ns Β±  13.982 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–…β–ˆβ–…β–           β–ƒβ–‚β–‚β–‚                                          ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–…β–…β–…β–†β–…β–†β–„β–„β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–†β–†β–‡β–†β–†β–†β–†β–†β–‡β–‡β–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–ˆβ–‡β–„β–…β–†β–†β–…β–†β–„β–†β–…β–…β–ƒβ–…β–„β–„β–„β–„β–†β–… β–ˆ
  349 ns        Histogram: log(frequency) by time        408 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I am wondering if there is anything I can do to avoid this performance degradation on small matrices so that I can use this new storage type by default.

Make the size a type parameter and dispatch to whatever is faster?

You could check @code_warntype for this line. I guess there is a few things that could go wrong

It can be a solution, but I want to understand what I am doing wrong. I really expect that the performance will increase in any case due to the caching.

I checked and it is type stable.

I’d probably go with a runtime branch instead for easier type stability and reduced compilation, but figuring out the cutoff isn’t straightforward. It could easily vary by platform. Considering the slowdown is a modest 0.806x at sub-microsecond scales, I question if it’s even worth the effort, especially since there is doubt about the implementation.

JET and Cthulhu are recursive, former automatically and the latter manually.

Usually this code is called inside a simulator many times per integration step. Hence, this minor degradation can lead to a substantial slowdown.

See also the discussion in this thread: Triangular Arrays in julia? - #27 by mkborregaard

My feeling is that the factor of 2 storage benefit is not going to be worth it for the performance degradation of giving up BLAS, if you want to do linear algebra (e.g. triangular solves) with the matrix. YMMV depending on what you want to do with the array.

1 Like

Actually I will use it only as storage. Thanks for the thread!

Now I saw something very strange and I really want to know why :slight_smile:

The original code for the derivative of the Legendre coefficients was:

            if m == 0
                ...
            elseif m == 1
                ...
            elseif n != m
                a_nm = +√(T(n + m) * T(n - m + 1)) / 2
                b_nm = -√(T(n + m + 1) * T(n - m)) / 2

                dP_nm = a_nm * P[iβ‚€ + n, jβ‚€ + m - 1] + b_nm * P[iβ‚€ + n, jβ‚€ + m + 1]

            else
                a_nm = +√(T(n + m) * T(n - m + 1)) / 2

                dP_nm = a_nm * P[iβ‚€ + n, jβ‚€ + m - 1]
            end

Now, if I change to:

            if m == 0
                ...
            elseif m == 1
                ...
            else
                a_nm  = +√(T(n + m) * T(n - m + 1)) / 2
                dP_nm = a_nm * P[iβ‚€ + n, jβ‚€ + m - 1]

                if n != m
                    b_nm   = -√(T(n + m + 1) * T(n - m)) / 2
                    dP_nm += b_nm * P[iβ‚€ + n, jβ‚€ + m + 1]
                end
            end

The performance of both implementations are equal for small matrices. Is this suppose to happen?

After a lot of tests, this simple change in the if allowed the new triangular storage to be faster in all scenarios. Its performance computing very big degrees is even better. I just hope that a compiler expert can tell me why this happened :slight_smile: I will learn a very useful experience for other scenarios.

I don’t understand the question. Probably you want to provide a reproducer. Also maybe open a new topic.

Can’t say much without something to tinker with, but it’s plausible at first glance that changed some code motion optimization or compacted the routine to fit in the instruction cache a bit better.

The problem is that the only way to actually reproduce the problem is copying that definition of the LowerTriangularStorage and execute it together with SatelliteToolboxLegendre.jl as I showed in the first post.

I’m skeptical that this is really the only way to see what is going on. But it can take a fair amount of effort to reduce a problem to a minimal working example that illustrates the issue.

1 Like

I completely agree! I meant that I was unable to reproduce the problem in any other way :smiley: I will keep trying.