Why is copying using a loop is much slower than `copy` for large arrays?

julia> function copy2(A::Array)
           B = similar(A);
           @inbounds @simd for i in eachindex(A, B)
               B[i] = A[i]
           end
           B
       end
copy2 (generic function with 1 method)

julia> A = rand(2000, 2000);

julia> @btime copy($A);
  4.162 ms (2 allocations: 30.52 MiB)

julia> @btime copy2($A);
  6.213 ms (2 allocations: 30.52 MiB)

julia> A = rand(20, 20);

julia> @btime copy($A);
  459.298 ns (1 allocation: 3.25 KiB)

julia> @btime copy2($A);
  432.207 ns (1 allocation: 3.25 KiB)

julia> copy(A) == copy2(A)
true

So copy2 is faster for small arrays, where time doesn’t matter as much, but is significantly slower for larger arrays. How to get that C-like performance in Julia?

My versioninfo:

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = subl
1 Like

A possible explaination :
The builtin-in copy simply calls a builtin function jl_array_copy, which in turn calls a builtin array constructor and memcpy function. Your simple implementation cannot beat C’s highly optimized memcpy (memcpy is really smart and it can sometimes even utilize special feature offered by operating system). The case of small array may be related to alignment problem or overhead in memcpy, since you copy directly and LLVM knows more information, some logic in memcpy can be skipped.

3 Likes

Non temporal stores. You can try LoopVectorization.vmapnt!(identity, y, x).

3 Likes

Does memcpy use multiple threads internally?

julia> A = rand(2000, 2000);

julia> @btime copy($A);
  12.939 ms (2 allocations: 30.52 MiB)

julia> @btime copy2($A);
  8.595 ms (2 allocations: 30.52 MiB)

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 48 × Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
  Threads: 1 on 96 virtual cores

On my machine, copy2 is faster. :person_shrugging:

2 Likes

I get a timing result similar to OP (4ms copy vs 6ms copy2) and I also use Linux system, so the difference may be explained by different implementation of memcpy on different OS or architecture. Anyway, I think sticking to the builtin implementation is always a good choice, so just let core developers handle these things as they are more professional on these micro-optimizations than normal users.

I think this is I/O bounded, not computation bounded. So if every thread can access memory independently, then it can greatly improve performance. But all the implementations that I know are single-threaded, and standard might not specify whether memcpy should be single- or multi- threaded. This is reasonable, because it’s a standard library function that provides some basic utilities and deals with common cases…

I get:

julia> function copyto2!(B, A)
           @inbounds @simd for i in eachindex(A, B)
               B[i] = A[i]
           end
           B
       end
copyto2! (generic function with 1 method)

julia> using LoopVectorization

julia> copytont!(B, A) = vmapnt!(identity, B, A)
copytont! (generic function with 1 method)

julia> A = rand(2000, 2000); B = similar(A);

julia> @btime copyto!($B, $A);
  2.724 ms (0 allocations: 0 bytes)

julia> @btime copyto2!($B, $A);
  3.933 ms (0 allocations: 0 bytes)

julia> @btime copytont!($B, $A);
  2.952 ms (0 allocations: 0 bytes)

So copyto! still wins, but copytont! is much closer.

If you want multithreaded:

julia> copytontt!(B, A) = vmapntt!(identity, B, A)
copytontt! (generic function with 1 method)

julia> @btime copytontt!($B, $A);
  498.435 μs (0 allocations: 0 bytes)

Aside the fact that copyto! was still faster than vmapnt! at large sizes, it should be faster at small sizes, too. Non temporal stores must be aligned, and they’ll hurt performance if you need to read them soon.

The memmove and memcopy should/will do the right thing as a function of size.

1 Like

How to use vmap or similar methods to vectorize or accelerate the following mapping:

function addsub!(z,w,x,y)
    @inbounds for i in eachindex(z,w)
        z[i]=x[i]+y[i]
        w[i]=x[i]-y[i]
    end
end

Thanks a lot.

@inbounds should do a good job optimizing this already.
Do you want non-temporal stores as well?

1 Like

Yes, that’s my question, because vmapnt! seems not to be able to update two variables at the same time