julia> function copy2(A::Array)
B = similar(A);
@inbounds @simd for i in eachindex(A, B)
B[i] = A[i]
end
B
end
copy2 (generic function with 1 method)
julia> A = rand(2000, 2000);
julia> @btime copy($A);
4.162 ms (2 allocations: 30.52 MiB)
julia> @btime copy2($A);
6.213 ms (2 allocations: 30.52 MiB)
julia> A = rand(20, 20);
julia> @btime copy($A);
459.298 ns (1 allocation: 3.25 KiB)
julia> @btime copy2($A);
432.207 ns (1 allocation: 3.25 KiB)
julia> copy(A) == copy2(A)
true
So copy2 is faster for small arrays, where time doesn’t matter as much, but is significantly slower for larger arrays. How to get that C-like performance in Julia?
My versioninfo:
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
JULIA_EDITOR = subl
A possible explaination :
The builtin-in copy simply calls a builtin function jl_array_copy, which in turn calls a builtin array constructor and memcpy function. Your simple implementation cannot beat C’s highly optimized memcpy (memcpy is really smart and it can sometimes even utilize special feature offered by operating system). The case of small array may be related to alignment problem or overhead in memcpy, since you copy directly and LLVM knows more information, some logic in memcpy can be skipped.
I get a timing result similar to OP (4ms copy vs 6ms copy2) and I also use Linux system, so the difference may be explained by different implementation of memcpy on different OS or architecture. Anyway, I think sticking to the builtin implementation is always a good choice, so just let core developers handle these things as they are more professional on these micro-optimizations than normal users.
I think this is I/O bounded, not computation bounded. So if every thread can access memory independently, then it can greatly improve performance. But all the implementations that I know are single-threaded, and standard might not specify whether memcpy should be single- or multi- threaded. This is reasonable, because it’s a standard library function that provides some basic utilities and deals with common cases…
julia> function copyto2!(B, A)
@inbounds @simd for i in eachindex(A, B)
B[i] = A[i]
end
B
end
copyto2! (generic function with 1 method)
julia> using LoopVectorization
julia> copytont!(B, A) = vmapnt!(identity, B, A)
copytont! (generic function with 1 method)
julia> A = rand(2000, 2000); B = similar(A);
julia> @btime copyto!($B, $A);
2.724 ms (0 allocations: 0 bytes)
julia> @btime copyto2!($B, $A);
3.933 ms (0 allocations: 0 bytes)
julia> @btime copytont!($B, $A);
2.952 ms (0 allocations: 0 bytes)
So copyto! still wins, but copytont! is much closer.
If you want multithreaded:
julia> copytontt!(B, A) = vmapntt!(identity, B, A)
copytontt! (generic function with 1 method)
julia> @btime copytontt!($B, $A);
498.435 μs (0 allocations: 0 bytes)
Aside the fact that copyto! was still faster than vmapnt! at large sizes, it should be faster at small sizes, too. Non temporal stores must be aligned, and they’ll hurt performance if you need to read them soon.
The memmove and memcopy should/will do the right thing as a function of size.