Specialized matrix-matrix multiplication algorithm

Hi! I’m an undergrad and I’m trying to replicate a result from a paper for a fast matrix-matrix multiplication algorithm (specifically the Bernstein lift matrix). Unfortunately, my code seems some 2-3x slower than what’s expected, compared to a naive matrix-matrix multiplication.

Here’s a (somewhat lengthy) MWE with precomputed values:

using SparseArrays
using BenchmarkTools
using LinearAlgebra

const L0 = sprand(36, 36, 0.164)
const tri_offset_table = (
    (0, 2, 3, 3, 2, 0, -3, -7, -12, -18, -25, -33, -42, -52, -63, -75, -88, -102, -117, -133, -150), 
    (0, 3, 5, 6, 6, 5, 3, 0, -4, -9, -15, -22, -30, -39, -49, -60, -72, -85, -99, -114, -130), 
    (0, 4, 7, 9, 10, 10, 9, 7, 4, 0, -5, -11, -18, -26, -35, -45, -56, -68, -81, -95, -110), 
    (0, 5, 9, 12, 14, 15, 15, 14, 12, 9, 5, 0, -6, -13, -21, -30, -40, -51, -63, -76, -90), 
    (0, 6, 11, 15, 18, 20, 21, 21, 20, 18, 15, 11, 6, 0, -7, -15, -24, -34, -45, -57, -70), 
    (0, 7, 13, 18, 22, 25, 27, 28, 28, 27, 25, 22, 18, 13, 7, 0, -8, -17, -27, -38, -50), 
    (0, 8, 15, 21, 26, 30, 33, 35, 36, 36, 35, 33, 30, 26, 21, 15, 8, 0, -9, -19, -30), 
    (0, 9, 17, 24, 30, 35, 39, 42, 44, 45, 45, 44, 42, 39, 35, 30, 24, 17, 9, 0, -10), 
    (0, 10, 19, 27, 34, 40, 45, 49, 52, 54, 55, 55, 54, 52, 49, 45, 40, 34, 27, 19, 10), 
    (0, 11, 21, 30, 38, 45, 51, 56, 60, 63, 65, 66, 66, 65, 63, 60, 56, 51, 45, 38, 30)) 
const l_j_table = (-3.5, 7.0, -8.75, 7.0, -3.5, 1.0, -0.125, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)

function reduction_multiply!(out, N, x, offset)
    row = 1
    @inbounds @fastmath for j in 0:N-1
        for i in 0:N-1-j
            k = N-1-i-j

            linear_index1 = i+1 + offset[j+1] + 1
            linear_index2 = i + offset[j+2] + 1
            linear_index3 = i + offset[j+1] + 1

            val = muladd((i+1), x[linear_index1], 
            muladd((j+1), x[linear_index2],
            muladd((k+1), x[linear_index3], 0.0)))
            out[row] = val/N
            row += 1
        end
    end
    return out
end

# E is a preallocated vector to perform some operations. N is the polynomial order.
function matrix_vector_multiply!(out, N, x, E)
    @inbounds @fastmath begin 
        LinearAlgebra.mul!(E, L0, x)
        index1 = div((N + 1) * (N + 2), 2)
        view(out,1:index1) .= E
        reduction_multiply!(E, N, E, tri_offset_table[N])
        for j in 1:N
            diff = div((N + 1 - j) * (N + 2 - j), 2)

            for i in 1:diff
                out[index1 + i] = l_j_table[j] * E[i]
            end

            if j < N
                index1 += diff
                reduction_multiply!(E, N-j, E, tri_offset_table[N-j])
            end
        end
    end
    return out
end

function matrix_matrix_multiply!(out, N, x, E)
    @simd for n in axes(x,2)
        @inbounds matrix_vector_multiply!(view(out,:,n), N, view(x,:,n), E)
    end
    return out
end

N = 7
Np2 = div((N + 1) * (N + 2), 2)
Np3 = div((N + 1) * (N + 2) * (N + 3), 6)
@benchmark matrix_vector_multiply!($(zeros(Np3)), $(N), $(rand(Float64, Np2)), $(zeros(Np2)))
@benchmark matrix_matrix_multiply!($(zeros(Np3, Np3)), $(N), $(rand(Float64, Np2, Np3)), $(zeros(Np2)))

I’m not quite sure if my matrix_matrix_multiply! function is the best way to stitch together the matrix-vector multiplications. When I compared to the mul! function, I found that my code is much faster for the matrix-vector multiplication, but slower for the matrix-matrix multiplication. Is there some better way to get a matrix-matrix multiplication from matrix-vector multiplication?

I’d appreciate any other pointers for where I might look to improve this! Thanks!

3 Likes

Maybe take a look at this tutorial on how general matrix-matrix multiplication is optimized for modern CPUs. I haven’t had a good look at your code or the underlying algorithm, but some (probably not all) of the techniques used there might be applicable to your case. In particular, doing matrix-matrix multiply purely in terms of matrix-vector multiplication is not optimal. You may see speedup by doing batches of a few (e.g., 4) vectors at a time, for example.

const L0 = sprand(36, 36, 0.164) is probably too small and too dense to benefit from a sparse representation. I would guess that a standard Matrix{Float64} would be faster.

You write for loops in several places where broadcasting appears applicable, such as your loop performing out[index1 + i] = l_j_table[j] * E[i]. Though loops are not intrinsically slower than broadcast, there are a couple of performance traps in writing loops that broadcasting might help you avoid (not sure whether you hit any here).

It looks like your tri_offset_table values follow a simple arithmetic pattern. It looks like tri_offset_table[N][j] == div((j-1) * (2N+4-j) , 2) or tri_offset_table[N][j] == tri_offset_table[N][j-1] + (N+3-j). You can compute this inline (using either calculation method) and probably it will probably be faster than looking it up in a pre-computed table.

Your @simd annotation almost-certainly does nothing here. Likewise, @fastmath can only affect float operations and cannot penetrate into function calls, which means it’s only actually (potentially) affecting

val = muladd((i+1), x[linear_index1], 
      muladd((j+1), x[linear_index2],
      muladd((k+1), x[linear_index3], 0.0)))

But the only thing @fastmath is likely to do here is apply the muladds that you already wrote. I would either keep your muladds and drop @fastmath (which can have very surprising effects you might not intend) or use @fastmath (preferably just on that line) to have a nice-looking calculation that it can optimize for you. Also, muladd(a, b, 0) is worse than a * b (though might be optimized to the same thing inside @fastmath).

Your use of @inbounds is precarious. While you probably haven’t made any mistakes here with your specific inputs, it seems that careless changes (not keeping N and the dimensions of your arrays consistent) could easily result in out-of-bounds access. Consider making more use of functions like axes, eachindex, etc in your loop ranges and the compiler can usually add correct inbounds effects for you.

2 Likes

Thanks so much for the reply!

In particular, doing matrix-matrix multiply purely in terms of matrix-vector multiplication is not optimal. You may see speedup by doing batches of a few (e.g., 4) vectors at a time, for example.

My matrix-vector multiplication is more involved than just taking an inner product product, and the tutorial, to my understanding, seems to be mostly about how to vectorize these inner products. It looks pretty difficult to me if I tried to do something similar?

const L0 = sprand(36, 36, 0.164) is probably too small and too dense to benefit from a sparse representation. I would guess that a standard Matrix{Float64} would be faster.

This seems accurate – I actually realized that multiplying by L0 is equivalent to two matrix-vector multiplications, so I removed this part entirely since the matrix-vector multiplications were faster.

It looks like your tri_offset_table values follow a simple arithmetic pattern. It looks like tri_offset_table[N][j] == div((j-1) * (2N+4-j) , 2) or tri_offset_table[N][j] == tri_offset_table[N][j-1] + (N+3-j). You can compute this inline (using either calculation method) and probably it will probably be faster than looking it up in a pre-computed table.

The lookup and the computation seem to be the same in speed, but I didn’t realize there was a nice formula for it. Thanks!

Also, muladd(a, b, 0) is worse than a * b (though might be optimized to the same thing inside @fastmath).

They didn’t optimize to the same thing – a * b is faster than the muladd(a, b, 0)! I’ve also removed all the @fastmaths.

On the @inbounds use, I agree that it’s risky. I have all of my arrays generated from N right now to try to keep in consistent. I’m not really iterating over any explicit axes, but over triangular indices so I don’t think there’s any nice ways to keep this safer unfortunately.

See the following discussion on what’s involved in implementing a fast matrix–matrix multiplication routine in pure Julia code (or any compiled language, for that matter), along with some example code: Julia matrix-multiplication performance

Also these course notes: 18335/notes/Memory-and-Matrices-latest.ipynb at spring21 · mitmath/18335 · GitHub

(A fundamental problem with implementing matrix–matrix multiplication as a sequence of matrix–vector products is that it has poor temporal memory locality, so it won’t take full advantage of the caches and you will end up being memory-bound. To do better, you have to divide the matrices into submatrix blocks rather than simply into columns; there are a variety of strategies for this, as discussed in the link above. But to get the last factor of 2–3 in performance, not including multi-threading, is pretty hard; you have to heavily optimize the low-level kernels.)

5 Likes

Don’t forget to add BLAS.set_num_threads(1) at the top of your script for fair benchmarking; otherwise, mul! will run multithreaded.

2 Likes

I’ll remark that your reduction_multiply! function can be transformed to a matrix-vector multiply where the matrix depends only on N. By my reckoning, the matrix has size(Np3,Np2) with 3N*(N-1)/2 nonzeros (maybe slightly fewer if some of your indices overlap). The non-tiny size and small number of nonzeros makes it a good candidate for a sparse matrix (marginal for N<5, perhaps, but at those small sizes it probably hardly matters).

While a sparse matrix representation is technically less efficient than something someone could write for this particular problem, it is probably almost as fast and all the complicated optimizations have been handled by the library writer. In particular, it might be efficient for sparse-times-dense matrix multiplies, which could be used in your matrix_matrix_multiply! instead of per-vector calculations. Ideally, you would write only the matrix-matrix version and the vector version would piggyback that as a 1-column matrix.


That aside, I did a little reorganizing and wrote this version which I think is equivalent to yours. I don’t expect it to be noticeably (if any) faster, but maybe it’s a tad easier to work from. It’s possible I made some small offset errors in the indexing stuff, so verify it’s correct before you consider using it.

function reduction_multiply2!(out, N, x)
    # recommended to compute N from the dimensions of out and x instead, and check that those dimensions fit
    row = 1
    oj2 = 0
    for j1 in 1:N # == j+1
        oj1 = oj2 # tri_offset_table[N][j+1]
        oj2 = oj1 + N+1-j # tri_offset_table[N][j+2]
        for i1 in 1:N-j1+1 # == i+1
            k1 = N+2-i1-j1 # == k+1

            linear_index1 = i1 + oj1 + 1
            linear_index2 = i1 + oj1 + (N+1-j)
            linear_index3 = i1 + oj1

            # could consider adding @inbounds, which will probably make this a tad faster
            out[row] = @fastmath (i1 * x[linear_index1] +
                j1 * x[linear_index2] +
                k1 * x[linear_index3]) / N
            row += 1
        end
    end
    return out
end

1 Like