Why is copying using a loop is much slower than `copy` for large arrays?

A possible explaination :
The builtin-in copy simply calls a builtin function jl_array_copy, which in turn calls a builtin array constructor and memcpy function. Your simple implementation cannot beat C’s highly optimized memcpy (memcpy is really smart and it can sometimes even utilize special feature offered by operating system). The case of small array may be related to alignment problem or overhead in memcpy, since you copy directly and LLVM knows more information, some logic in memcpy can be skipped.

3 Likes