Random memory access performance - how to hint compiler/runtime to preload/cache/plan

Hi,

I have some constant size matrices (so I use StaticArrays package, verified it’s faster than Array/Vector) which I access many times randomly in a loop.

After random reads/writes, I need to copy a part of one array onto another (see code below).
The first each time (after the random reads/writes) it is surprisingly slow.
But then if I access the same subset of the array immediately again, it is much faster.

Even with Compiler Optimizations disabled.
Even with GC disabled.

How can I tell Julia/LLVM/C/OS/CPU my planned memory access pattern so it will optimize for it, or keep the matrix in stack and not move it to heap?

Unfortunately I wasn’t able to reproduce it exactly in a separate code, but will do my best to show the main parts:

using BenchmarkTools
using LoopVectorization
using StaticArrays

mem_size = 16
mem_type = Float32
mem_src = MArray{Tuple{8000,mem_size},mem_type,2, 8000 * 16}(undef)
mem_dst = MArray{Tuple{64,mem_size}, mem_type, 2, 64 * 16}(undef)
addr_src = 100
addr_dst = 5

function my_copyto!(dest::Ptr{mem_type}, src::Ptr{mem_type}, n)
    GC.@preserve dest src ccall(:memcpy, Ptr{Cvoid}, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t),
          dest, src, n * Base.aligned_sizeof(mem_type))
end

# high-level method to copy memory efficiently
@inbounds @turbo for i = 1:mem_size
    mem_dst[i, addr_dst] = mem_src[i, addr_src]
end

# low-level method to copy memory efficiently
dst_ptr = Base.unsafe_convert(Ptr{mem_type}, pointer_from_objref(mem_dst)) +
    sizeof(mem_type) * (1+(mem_size*(addr_dst-1)))
src_ptr = Base.unsafe_convert(Ptr{Float32}, pointer_from_objref(mem_src)) +
    sizeof(mem_type) * (1+(mem_size*(addr_src-1)))
GC.@preserve mem_dst mem_src my_copyto!(dst_ptr, src_ptr, mem_size)

# also tried with copyto!(), unsafe_copyto!(), assignmen throught '=' and '.='

I’ve been through everything I could find, to come up with what was already supposed to be the most optimized implementation to copy a piece of memory from one point to another.

But there is still something elusive. Is it paging? Is it stack/heap? Is it CPU cache?

Please help with any advice/hint/direction. Thanks!