This iteration structure is slow. Iterate along columns not rows.
Note that sparse AD will absolutely murder your handcode in performance though, since it would use a coloring vector with chunked ForwardDiff to SIMD multiple elements along the diagonal at the same time. I wouldn’t even want to show you the equivalent code because it would be nasty to write out by hand, but you’d effectively clump chunks of 8 columns at the same time and iterate down those columns, and then have another loop on top that blocks it and SIMD from that outer loop, into a denseified matrix which matches the Tridiagonal structure.