@inbounds
needs to go inside @threads
What is the cause of this performance difference between Julia and Cython? - #4 by tkf
In general, try minimizing the scope of @inbounds
.
I think this needs @inbounds
to compare this properly with LoopVectorization.jl and Tullio.jl. This also applies to @threads
and @floop
.
GC may re-use the same memory region. Maybe a more robust approach is to allocate a lot of arrays such that the total number of bytes is at least as large as (say) a double of the L3 cache size and use them one by one.