I thought the last section “The importance of higher-order inlining” was very interesting. I’ve been wondering whether there was a tension between loop fusion and SIMD optimizations. It seemed to me that A.*B.+C could end up faster than f(a, b, c) = a*b+c; f.(A, B, C) if the first version used SIMD. I thought the second version wouldn’t be able do that, but now I see how inlining solves that problem.
One thing I’m unclear about. If f contains some operations without SIMD support (erf?), will SIMD occur at all, or is this impossible? I’d think it could operate chunk by chunk, doing SIMD operations on what it can, then finishing serially, but this is probably tricky.