So just to extend, my benchmarks were of course a bit off because I did not account for higher-order caches (which are always hot in the benchmark loop, and large enough to account for a sizable fraction of the table). Just to show you how fast your CPU is compared to its puny memory bus (repeat on your target computer):
julia> const sqrttable=[sqrt(i) for i=1:10_000_000];
julia> sqrt_tab_ib(x)=(@inbounds r= sqrttable[x]; r);
julia> vals=rand(1:10_000_000, 10_000_000);
julia> function cpx(V)
res=similar(V)
@simd for i=1:length(V)
@inbounds res[i]=V[i]
end
res
end
julia> @btime copy(vals);
48.847 ms (2 allocations: 76.29 MiB)
julia> @btime cpx(vals);
39.668 ms (2 allocations: 76.29 MiB)
julia> @btime sqrt.(vals);
66.954 ms (26 allocations: 76.30 MiB)
julia> @btime sqrt_tab_ib.(vals);
159.656 ms (26 allocations: 76.30 MiB)
Vectorized sqrt is not much slower than memcopy from/to main memory. I have no idea why the default copy appears to be slow on my system (maybe I borked my 0.6 sysimg?).
In the table lookup case, each entry costs 1x write and 1+8x read (one cache line = 8 Float64). In the memcopy case we need 1x write and 1x read for an expected factor of 5 I’d call that close enough.
Postscript: My system sucks. The problem is not julia’s slow copy, it is that memcopy is significantly slower that the simd-copy on my system.
julia> function memcpx(V)
res = similar(V)
ccall(:memcpy, Ptr{Void}, (Ptr{Void}, Ptr{Void},Csize_t), pointer(res), pointer(V), sizeof(V))
res
end
julia> @btime memcpx(vals);
47.814 ms (2 allocations: 76.29 MiB)
Pstscript2: Meh, I’m really bad at guessing what I’m benchmarking. Need to pre-allocate buffers. All the copying does not measure memory speed; probably how fast the kernel is at faulting in freshly zeroed memory.
julia> memcpy(dst,src)=ccall(:memcpy, Ptr{Void}, (Ptr{Void}, Ptr{Void},Csize_t), pointer(dst), pointer(src), sizeof(src));
julia> src=zeros(10_000_000); dst=similar(src);
julia> @btime memcpy(dst,src)
8.518 ms (7 allocations: 112 bytes)
julia> @btime zeros(10_000_000);
36.124 ms (2 allocations: 76.29 MiB)