How to improve the Thomas algorithm for block tridiagonal matrices

Using * instead of .* means that the multiplication is not fused with the .-= — it allocates a separate array, in a separate loop. See the “more dots” performance tip. You might want to read this article about Julia’s “dot” notation and what it does.

Also, b[n*(i-1)+1:n*i] allocates a new array, and also b[n*i+1:n*(i+1)] on the right-hand side, since you aren’t using @views. See the “consider using views” performance tip. Probably best to just put @views in front of function so that you use it everywhere in the function, since right now you are allocating lots of copies for slices. (Then you can get rid of the @view calls.)

I thought you didn’t want to overwrite the input array?

Why are you using a separate pre-allocated buffer D_buf[i] for every loop iteration, rather than just allocating a single buffer and re-using it?

Not using mul! with a buffer? Similarly for the U[i] * operation later?