@inbounds needs to go inside @threads What is the cause of this performance difference between Julia and Cython? - #4 by tkf
In general, try minimizing the scope of @inbounds.
I think this needs @inbounds to compare this properly with LoopVectorization.jl and Tullio.jl. This also applies to @threads and @floop.
GC may re-use the same memory region. Maybe a more robust approach is to allocate a lot of arrays such that the total number of bytes is at least as large as (say) a double of the L3 cache size and use them one by one.