Linear algebra/algorithm: possible speed up for summing up matrices commutators

With matrix commutator [A,B] \equiv AB-BA, consider the sum of \sum_{i=1}^k [MF_i, F_{k-i+1}], where M and F_{j} are known matrices. I’m using something like this that allocate in advance:

using LinearAlgebra
using BenchmarkTools


function comu!(r, a, b) # r = a*b-b*a, allocated r
    idy = one(eltype(r))
    mul!(r, a, b)
    mul!(r, b, a, -idy, idy)
end


function cm_sum!(res_k, k, f, mx, tmpM, mcmu) # add sum to res_k
    for ik = 1:k
        mul!(tmpM, mx, f[ik])
        comu!(mcmu, tmpM, f[k-ik+1])
        res_k .-= mcmu
    end
end

ndim = 50
k = 60
r = zeros(Float64, ndim, ndim)
mx = rand(ndim, ndim)
t1 = similar(r)
t2 = similar(r)

fs = [rand(ndim, ndim) for j = 1:k]

@btime cm_sum!($r, $k, $fs, $mx, $t1, $t2)

834.737 μs (0 allocations: 0 bytes)

Other than @inbounds and threading/parallelization, are there other julia-specific tricks or more efficient algorithms for this sum?

usage case: mx and fs in real use is not random but given/calculated; this res_k is one term in a big differential equation system where k also loops over 1 through some cutoff N. Since differential equation also loops over a time domain 0 through t_final, and the diff. eqn. is iterated over thousands of initial values, this inner-most loop is quite important performance-wise.

Thanks!

gpu?