Slower @threads than serial for array computations

Many thanks! Using your new approach on my machine I’m now getting ~ 2.4 seconds for the calculations. As you said improvements are due to inlining and using views. At the moment, I’m not seeing a difference when substituting @inbounds @simd for @avx. Also passing function kf::F instead of K::AbstractKernel doesn’t change the performance - even though at the moment your version has hard coded dotkernel in your kernelfn function - it doesn’t see to change the performance - probably be cause the compiler takes care of it.