Unclear allocation behaviour with built-in sum()

mul!(B, A', v) seems to be what you’re looking for. Or just B = A' * v if you don’t have B preallocated. You might have to @view to trim off the τ==1 case, since you seem to skip that index, but it looks like your A already has the structure such that trimming it isn’t necessary.
mul!/* are already multithreaded (because they call BLAS for some AbstractArray types like Matrix{Float64}, which you appear to use here and BLAS is multithreaded).

Array slicing allocates new arrays, which is probably hurting you here. Consider using A_previous = @view A[:, end] and B[τ] = @view(A[:, τ]) ⋅ v in situations like this.

I didn’t get deep into the rest of your code because it’s a little complex and it’s early here, but maybe start with some of these pieces.

2 Likes